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HDL simulators 
Grading 


B Homework's 

B Assignments 

B Quizzes 

B 2-Midterm Exams 


B Final Exam 


Course Syllabus 
* CH 1 Computer Abstractions and Technology 


% (CH 2) Describe the instruction set architecture of a MIPS 
processor. Analyze, write, and test MIPS assembly language 


programs. 


* (CH 3) Describe organization/operation of integer & floating- 


point units. 


* (CH 4) Design the datapath and control of a single-cycle, Multi- 
cycle, and pipelined CPUs, & handle hazards. 


% (CH 5) Describe the organization/operation of cache memory. 
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What is the study of 
Computer Architecture? 
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What is Computer Architecture? 


Architecture 
e abstraction of the hardware for the programmer 
* instruction set architecture 
° Instructions: 
* operations 
* operands, addressing the operands 
° how Instructions are encoded 
* Storage locations for data 
* registers: how many & what they are used for 
° memory: its size & how it is accessed 
* ИО devices & how to access them 
* software conventions: 
* subroutine calls: who saves the registers, which 
ones are saved 
° passing parameters: in registers? on the stack? 
• the interface between the software & hardware 


* Architecture = interface between hardware and software 


Software 


Interface between HW & SW 


Memory 
Gates & Transistors 
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What is Computer Organization? 


Organization or Microarchitecture 
* basic components of a computer 
e onthe CPU (ALU, registers, PC, etc.) 
* memory (levels of the cache hierarchy) 
* how they operate 
* how they are connected together 


Organization is mostly invisible to the programmer 
* today some components are considered part of the 
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* why? because a programmer can get better performance if 
he/ she knows the structure 


* for example: the caches, the pipeline structure 


System Organization 


disk graphics network 
controller controller interface 
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Instruction Set Architecture (ISA) 
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| vo | | Hardware 


e Computers are complex, built in layers 
e Several software layers: assembler, compiler, OS, applications 
e Instruction set architecture (ISA) 
e Several hardware layers: transistors, gates, CPU/Memory/IO 





е Build computer bottom up by raising level of abstraction 
e Solid-state semi-conductor materials — transistors 
e Transistors — gates 


e Gates — digital logic elements: latches, muxes, adders 
e Key insight: number representation 


e Logic elements — datapath + control = processor 
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What іс Computer? 


е Is a machine that can solve problems 
for people by carrying out instructions 
given to it 


e [he sequence of instructions is call 
Program 


The following block diagram describes the Basic 
Architecture of a Digital Computer: 


Central Processing Unit (CPU) 


Arithmetic / 
Logic Unit 
(ALU) 





Applications 

m Automatic teller machines 
m Computers in automobiles 
m Laptop computers 

m Human genome project 


m World Wide Web 
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Differences Between Computers 


% We have different computers for different purposes. 


% Some can achieve performance needed for high performance gaming 
o E.g., Cell Processor in PlayStation 4. 


% Others can achieve decent enough performance for laptop without using 


too much power. 
o E.g., Intel Pentium M (for Mobile) 


% Some are cheap enough for your DVD player. 


% And yet others can function reliably enough to be trusted with the control 
of your car's brakes. 


Price/Performance Pyramid 
ЖА $Millions 


$100s Ks 
/ sever N $10s Ks 
ws 


Example Machine Organization 









са Workstation design target 
e 25% of cost on processor 
e 25% of cost on memory (minimum memory size) 
e Rest on I/O devices, power supplies, box 


Computer 


CPU Devices 


Control Input 


Datapath Output 
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Classes of Computing Applications апа Their Characteristics 


- Personal computers 
- General purpose, variety of software 
- Subject to cost/performance tradeoff 


Server computers 

» Network based 

» High capacity, performance, reliability 
- Range from small servers to building sized 





. Supercomputers 

- High-end scientific and engineering 
calculations 

= Highest capability but represent a 
small fraction of the overall computer § 
market 





Embedded computers e > = © 


- Hidden as components of systems ® a | : 
Si 





= Stringent power/performance/cost 
constraints 





Zac = k.) 
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The PostPC Era 


Personal Mobile Device (PMD): are small wireless devices to connect to the Internet 
- Battery operated 
- Connects to the Internet 
- Hundreds of dollars 
- Smart phones, tablets. 


Cloud computing: refers to large collections of servers that provide services over the 


Internet. 
- Warehouse Scale Computers (WSC) 
- Software as a Service (SaaS) 
- Portion of software run on a PMD and a portion run in the Cloud 
- Amazon and Google 


- The number manufactured per year of tablets and smart phones, which reflect the PostPC era, 
versus personal computers and traditional cell phones. Smart phones represent the recent 
growth in the cell phone industry, and they passed PCs in 2011. Tablets are the fastest growing 
category, nearly doubling between 2011 and 2012. Recent PCs and traditional cell phone 


categories are relatively flat or declining. 





1400 
_ — A N dm pe Ines 
1000 including smart phone) 
2 800 
О 
— Smart phone sales 
— 600 
400 PC (not including 
tablet) 
200 
Tablet 
0 


2007 2008 2009 2010 2011 2012 
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Understanding Performance 


Both the software and hardware affect the performance of a program 
Algorithm 
= Determines number of operations executed 
Programming language, compiler, architecture 
s Determine number of machine instructions executed per operation 
Processor and memory system 
=» Determine how fast instructions are executed 
/О system (including OS) 


s Determines how fast I/O operations are executed 










Compiler 


Problem algorithm machine 





program code 


How fast can you solve a problem on a machine? 


Depends on 


е The algorithm used 
е The HLL program code 


е The efficiency of the compiler 
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Hardware and Software as Hierarchical Layers 


- А typical application, such as a word processor or a large database system, 
may consist of millions of lines of code. 


- The hardware in a computer can only execute extremely simple low-level 
instructions. 


- To go from a complex application to the simple instructions involves several 
layers of software that interpret or translate high-level operations into simple 


computer instructions. (abstraction )!! 


Application software 
= Written in high-level language 
oystem software 
= Compiler: translates HLL code to 
machine code 
= Operating System: service code 
Handling input/output 
Managing memory and storage 
scheduling tasks & sharing resources 
Hardware 
= Processor, memory, I/O controllers 





Compilers: Translation of a program written in a high-level language, such as 
C, C++, Java, or Visual Basic into instructions that the hardware can execute. 


Operating system: Interfaces between a user’s program and the hardware 
and provides a variety of services and supervisory functions. 
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From а High-Level Language to the Language of Hardware 
Instruction: A command that computer hardware understands and obeys. 


High-level programming language: A portable language such as C, C++,Java, 
or Visual Basic that is composed of words and algebraic notation that can be 
translated by a compiler into assembly language. 


Assembler: A program that translates a symbolic version of instructions into the 
binary version. 


Assembly Language: A symbolic representation of machine instructions. 


Machine Language: A binary representation of machine instructions. 


High-level swapCint wC]. int k) 
language lint temp: 

program temp = v[k]: 

(im C) v[k] = v[k+11: 


v[k+1] = temp; 






( Compiler | 





Assembly swap: 

language multi $2, $5.4 

program add $2. %4,%2 

(for MIPS) 1м $15, D(32) 
lw $16. 4(*$2) 
SW $16. O01 $2) 
SW $15. 4(%2) 
jr $31 





( Assembler 


Binary machine 00000000101000100000000100011000 


language 00000000100000100001000000100001 
program 10001101111000100000000000000000 
(for MIPS) 10001110000100100000000000000100 


10101110000100100000000000000000 
10101101111000100000000000000100 
00000011111000000000000000001000 
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Components of a Computer 


Same components for 
all kinds of computer 
- Desktop, server, 
embedded 
- Input/output includes 
= User-interface devices 
- Display, keyboard, mouse 
Storage devices 
- Hard disk, CD/DVD, flash 
= Network adapters 


- For communicating with 
other computers 





Inside the Processor (CPU) 





Address 







Instruction 


Instruction 
Memory 








MemW rite RegWrite 


$ = 
B у 
B z z 
Datapath: performs the arithmetic operations. 


Control: tells the datapath, memory, and I/O devices what to do according to the 
wishes of the instructions of the program. 


Cache memory A small, fast memory that acts as a buffer for a slower, larger 
memory. 
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А Safe Place for Data 


Volatile main memory 

s Loses instructions and data when power off 
Non-volatile secondary memory 

s Magnetic disk 

s Flash memory 

« Optical disk (CDROM, DVD) 








Networks 
. Communication, resource sharing, 
nonlocal access 
Local area network (LAN): Ethernet 
Wide area network (WAN): the Internet 
- Wireless network: WIFI, Bluetooth 
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Ехатріе 
For problems below, use the information about access time for every type of mem- 
ory in the following table. 





b. 





Magnetic Disk 





5 us 5 ms 
í ns ТО ns 15 us 20 ms 


» Find how long it takes to read a file from a DRAM if it takes 2 
microseconds from the cache memory. 


> Find how long it takes to read a file from a disk if it takes 2 micro- 
seconds from the cache memory. 


» Find how long it takes to read a file from a flash memory if it 
takes 2 microseconds from the cache memory. 


For configuration (a): 
* From the table, we find that the DRAM time is equal the 10 * Cache time, so 


- The required time to read from DRAM =10*2 microsecond = 20 microsecond. 


* From the table, we find that the flash time is equal the 1000 * Cache time, so 


- The required time to read from flash =1000*2 microsecond = 2 msec . 


* From the table, we find that the magnetic disk time is equal the 1000000 * Cache time, so 
- The required time to read from Magnetic Disk 21,000,000 “2-2 sec 

For configuration (b): 

* From the table , we find that the DRAM time is equal the 10 * Cache time, so 

- The required time to read from DRAM =10*2 microsecond = 20 microsecond. 

* From the table ,we find that the flash time is equal the 2141 * Cache time, so 


- The required time to read from flash =2142*2 microsecond = 4.28 msec. 


* From the table ,we find that the magnetic disk time is equal the 2857142 * Cache time, so 
- The required time to read from Magnetic Disk =2857142 *2 =5.7 sec 
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The Instruction Execution Cycle 


Obtain instruction from program storage 
Determine required actions and instruction size 


Locate and obtain operand data 


Compute result value or status 


Deposit results in storage for later use 


Determine successor instruction 


Main memory 


Instruction 


жаз hH к= = 


Instruction 


Instruction 


ы m — 2 
n-i 


Program counter 


Instruction register 

Memory address register 
Memory buffer register 
Input/output address register 
Input/output buffer register 
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Technology Trends 
Ш Electronics technology continues to evolve 
B Increased capacity and performance 


E Reduced cost 


Technology Relative performance/cost 


1975 Integrated circuit (ІС) 
1995 Very large scale ІС (VLSI) 2,400,000 
2013 Ultra large scale IC 250,000,000,000 





Moore' s Law 


€» In 1965, Gordon Moore noted that the 
number of transistors on a chip doubled 
every 18 to 24 months. 


€» He made a prediction that semiconductor 
technology will double its effectiveness every 
18 months 











TIPS ! | 
Law | 
1 Billion Transistors 
Š = | 
E a 8 1,000,000 
m 
$ ° 100,000 
E > Pentium III 
| Š 10,000 Pentium Il 
: Б Pentium’ Pro 
а = 1,000 Pentium’ 


1486 
100 


404 8086 


Source: Intel 





1975 1980 1985 1990 1995 2000 2005 2010 


Projected 20” 
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Impacts of Advancing Technology 


a Processor 
е performance: 2x every 1.5 years 
ClockCycle = 1/ClockRate 
500 MHz ClockRate = 2 nsec ClockCycle 
1 GHz ClockRate = 1 nsec ClockCycle 
4 GHz ClockRate = 250 psec ClockCycle 


a Memory 
е DRAM capacity: 4x every 3 years, now 2x every 2 years 
е memory speed: 1.5x every 10 years 


e cost per bit: decreases about 25% per year 
са Disk 
e capacity: increases about 60% per year 


(a) 1970s Processors 


ан | жін | жін | мн | змне TOR | SM, 


ыза — [we | re [ ne [| жә р я 
Number of transistors x 2.300 3.500 6,000 29.000 29.000 

I E ee e т > — аку Жей 
Аавын menory | Gove | 16KB | өкі | — Pam — — 


(b) 1980s Processors 
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Feature size "Y um) 


(с) 1990s Processors 





Тр, 
мсек, | #%® | 3m | ба | ба” 
mmm | &m | Wm | ыз | 5m _ 
"ach SB | 8B | рв: 


Clock speeds 
Bus width 


Number of transistors 





512 kB LI and 


(d) Recent Processors 


Ces sae [тиши [Сог 








Feature size (nm) 


Addressable memory 
Virtual Virtual memory 
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Semiconductor Technology 
B Silicon: semiconductor 
B With a special chemical process, it is possible to add materials to 
silicon to transform into one of three devices 

e Conductors (using either microscopic copper or aluminum wire) 

e Insulators (like plastic sheathing or glass) 

e SWitch (Areas that can conduct or insulate under special conditions) 
B VLSI circuit is just billions of combinations of conductors, 


insulators, and switches manufactured in a single small package. 
Manufacturing ICs 


Silicon crystalingot: А rod composed of a silicon crystal that 15 between 8 and 12 
inches in diameter and about 12 to 24 inches long. 

Wafer: А slice from a silicon ingot no more than 0.1 inches thick, used to create chips. 
Defect: A microscopic flaw in a wafer or in patterning steps that can result in the 
failure of the die containing that defect. 

Die: The individual rectangular sections that are cut from a wafer, more informally 
known as chips. 

Yield: The percentage of good dies from the total number of dies on the wafer. 


Bonding: connected the good dies to the input/output pins of a package. 
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silicon ingot 


20 to 40 
processing steps 





Tested dies Tested Patterned wafers 


UU 
т ХОХ 
Bond die to ggg 

package MUL 





Packaged dies Tested packaged dies 


DO Ship to 
tester customers 








The chip manufacturing process. After being sliced from the silicon ingot, blank 
waters are put through 20 to 40 steps to create patterned wafers. These patterned wafers 
are then tested with a wafer tester and a map of the good parts 1s made. Then, the wafers 
are diced into dies. The good dies are then bonded into packages and tested one more 
time before shipping the packaged parts to customers. 
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Intel Core i7 Wafer 
12 inch (300mm) 
280 chip 
217 mm? 

32 nm technology 
731,000,000 transistor 


Dr. Ahmed Jaber Spring 2019 


24 


ШИН: 
ИННИ: 
Нн? 


li 


' , WR; 
i ` УАТ. 
4 j mt; 
, m 
Ф Є жтт “ 
ы ж ^ 
e 
ы ы à 
Д г " 
' 4 4$ 
4 г ж, 
Н ; МА, 
ЕБ: чала 
ян жи 
, 





Intel Pentium 4 Wafer 
8 inch (200mm) 
165 chip 
250 mm? 

180 nm technology 
55,000,000 transistor 


Control 


Control VO 


interface 


Instruction cache 
Data 
cache 


Enhanced 
floating point 


and multimedia Integer 
datapath 


Secondary 


cache 
and 
memory 
Control interface 


Advanced pipelining 


Control 
hyperthreading support 





Inside the processor chip Тһе left-hand side is a microphotograph of the Pentium 4 processor chip, 
and the right-hand side shows the major blocks in the processor. 
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Integrated Circuit Cost 


The cost of an integrated circuit сап be expressed іп the following equations: 


Wafer area 
Die area 


( 


Dies per wafer 


Cost per wafer 


Cost per die = ——————————— 
Dies per wafer x yield 
Yield — - 


(1 + (Defects per area x Die агеа/2))% 
Note 


The number of dies per wafer 1s approximately the area of the wafer divided by the 
area of the die. It can be more accurately estimated by 


кеніне А а А | 
out | f T x (Wafer diameter/2) m x Wafer diameter 
Die area 





Example Find the number of dies per 300 mm (30 cm) wafer for a die that is 1.5 cm on a 
side and for a die that is 1.0 cm on a side. 


А : ? 
Answer When die area is 2.25 ст: 


) 








Dies per wafer = T x (30/2) - - 106.9 - 94.2 - 270 
| 2.25 225 2.12 
D 
Dies per wafer = 75/2) лх90 7069 942. Gy, 


L00 — Ох100 100 141 
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What 1s the Performance? 


6.5 hours 610 mph 286,700 


178,200 


Which of the planes has better performance 





3 hours 1350 mph 


= The plane with the highest speed is Concorde 
= [he plane with the largest capacity is Boeing 747 


"Time of Concorde vs. Boeing 747? 
sConcord is 1350 mph / 610 mph = 2.2 times faster 


«Throughput of Concorde vs. Boeing 747 ? 

"Boeing is 286,700 pmph / 178,200 pmph = 1.6 times faster 
Boeing is 1.6 times faster in terms of throughput 
«Concord is 2.2 times faster in terms of flying time 


"When discussing processor performance, we will focus primarily on 
execution time for a single job - why? 
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Response Time and Throughput 


1-  Responsetime (Execution time): The total time required the computer to complete 
a task, including disk accesses, memory accesses, /О activities, operating system 
overhead, CPU execution time, and so on. 


2- Throughput (Bandwidth): It is the number of tasks completed per unit time. 


Throughput and Response Time 
Do the following changes to a computer system increase throughput, decrease 
response time, or both? 

1. Replacing the processor in a computer with a faster version 


2. Adding additional processors to a system that uses multiple processors 
for separate tasks— for example, searching the web 


Decreasing response time almost always improves throughput. Hence, in case 
1, both response time and throughput are improved. In case 2, no one task gets 
work done faster, so only throughput increases. 


Note: In many real computer systems, changing either execution time or throughput 
often affects the other. 


To maximize performance, we want to minimize response time or execution time for 
some task. Thus, we can relate performance and execution time for a computer X: 


" 1 
Performance, = ——. — 
Execution time, 

This means that for two computers X and Y, if the performance of X is greater than 
the performance of Y, we have 


Performance, > Performance, 


Performance, 


Performance, 


If X is п times as fast as Y, then the execution time on Y is п times as long as it is 
on X: 


Performance, _ Execution time, 


Performance, Execution time, 
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Example 


If computer А runs a program in 10 seconds and computer B runs the same 
program in 15 seconds, how much faster is А than В? 


We know that А is n times as fast as B if 


Performance, Execution time, 


Performance, Execution time, 


Thus the performance ratio is 


and A is therefore 1.5 times as fast as B. 


Who Affects Performance? 


* programmer 

* compiler 

* instruction-set architect 

* machine architect 

* hardware designer 

* materials scientist/physicist/silicon engineer 
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Measuring Performance 


1- CPU execution time (CPU time) 

CPU time (or CPU Execution time) is the time between the start and the end of execution of a given 
program. This time accounts for the time CPU is computing the given program, including operating 
system routines executed on the program's behalf, and it does not include the time waiting for I/O 


and running other programs. CPU time Comprises user CPU time and system CPU time 


- user CPU time : The CPU time spent in a program itself. 
- system CPU time The CPU time spent in the operating system performing tasks on behalf of 
the program. 


2- Elapsed time 
Total response time, including all aspects, processing, /О, OS overhead, idle time. 


3- СРО Clocking 


«—Clock period— 


Clock (cycles) 


Data transfer 
and computation 


Update state 





Clock period: duration of a clock cycle 
s е.0., 250ps = 0.25ns = 250х10-'25 

Clock frequency (rate): cycles per second 
| eg., 4.0GHz = 4000MHz = 4.0x10?Hz 


Ш 1 GHz = 107 cycles / s (cycle time 107? s = 1 ns) 
200 MHz = 200 x 10° cycles / s (cycle time = 5 ns) 
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Faster Clock ж Shorter Running Тіте 





Solution 


4 steps o 


ú 20 steps 


In this example, addition time 
does not improve in going from 
1 GHz to 2 GHz clock 





cycle 
cycle/sec 





- Performance Improved by 

» Reducing number of clock cycles 

= Increasing clock rate 

- Hardware designer must often trade off clock 
rate against cycle count 
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Example 
Computer A: 2GHz clock, 10s CPU time 
Designing Computer B 
- Aim for 6s CPU time 
s Can do faster clock, but causes 1.2 x clock cycles 
How fast must Computer B clock be? 


Answer 


Тһе number of clock cycles required for the program on А: 


CPU time, — CPU clock cycles , 
| Clock rate, 
CPU clock cycles 
10 seconds — ee 
2 x 107 cycles 
second 


“47 | 
CPU clock cycles, = 10 seconds x 2 x 10” D = 20x 10° cycles 
secon 


CPU time for B can be found using this equation: 


1.2 x CPU clock cycles, 


ақан J Clock rate 
тең 


1.2 X 20 X 10° cycles 
6 seconds = —— Ды. 


Clock rate; 
1.2 X 20 X10 cycles 0.2 х 20 х 10° cycles 4Х10” cycles 
Clock rateg = iL AM _ ыызы ыш ЖЫ ене. 4 ОН? 
6 seconds second second 


To run the program in 6 seconds, B must have twice the clock rate of A. 
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Instruction Performance 


- The performance equations above did not Include any reference to the number of 
instructions needed for the program. 


- Execution time equals the number of instructions executed multiplied by the 
average time per instruction. 


- Clock Cycles Per Instruction (CPI): Average number of clock cycles per 
instruction for a program. 


- Instruction Count (IC):The number of instructions executed by the program. 


- Therefore, the number of clock cycles required for a program can be written as 


"PU clock суй | | Average clock cycles 
CPU clock cycles = Instructions for a program X per instruction 


Clock Cycles =Instruction Countx Cycles per Instruction 
CPU Time- Instruction Count: CPI: Clock Cycle Time 


. Instruction Count x CPI 
i Clock Rate 
Instruction Count for a program 
Determined by program, ISA and compiler 
Average cycles per instruction 


Determined by CPU hardware 


If different instructions have different CPI 
Average CPI affected by instruction mix 





E 





Note:- The three key factors (Instruction count , CPI, and clock cycle time) effect on 
the performance. We can use these formulas to compare two different implementations 


or to evaluate a design alternative if we know its impact on these three parameters. 
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Example 


Computer А has a clock cycle time of 250 ps and a CPI of 2.0 for some program, 
and computer B has a clock cycle time of 500 ps and a CPI of 1.2 for the same program. 


Which computer 1s faster for this program and by how much? 


Answer 


We know that each computer executes the same number of instructions for 
the program; lets call this number I. First, find the number of processor clock 
cycles for each computer: 

CPU clock cycles, = I x 2.0 

CPU dock cycles, = I x 1.2 


Now we can compute the CPU time for each computer: 
CPU time, = CPU dock cycles, X Clock cycle time 
= [х 2.0 X 250 ps = 500 X I ps 
Likewise, for B: 
CPU time, = Í X 1.2 X 500 ps = 600 X I ps 
Clearly, computer А is faster. The amount faster is given by the ratio of the 
execution times: 
CPU performance, Execution time, — 600 X Ips | 


CPU performance, Execution time, 500 X Ips 


We can conclude that computer A is 1.2 times as fast as computer B for this 
program. 
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If different Instruction classes take different 
numbers of cycles 


Clock Cycles — 23 (CPI, x Instruction Count, ) 


Weighted average CPI 


Clock Cycles ` _ [em _ Instruction count, ) 


CPI — 
Instruction Count E ^ Instruction Count 





= The clock rate of frequency f is given by f —1/« 
= ТОЛ/Н2 





æ —100ns > f =- 
100ns 


= Тһе size of a program is determined by its instruction 
count IC 


= The CPI; (cycles per instruction) represents the number 
of CPU cycles required by instruction г to execute. 


= The average CPI of a program P: 
IC, 





CPI = >` CPI ix = > CPI ; Х Freq ; 
ieClasses 16 ieClasses 


Ex: consider the following program: 
ADD А,В CPI = 3 

MUL C,D CPI = 4 

ADD A,C 


| I 2 1 
е [he effective CPI is CPI =3 XX 4x. = 3.32 
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Example 


/Nternative compiled code sequences using 
Instructions in classes А, B, C 


CPI for class 





IC in sequence 1 
IC in sequence 2 


Which code sequence executes the most instructions? Which will be faster? What 
is the CPI for each sequence? 
Answer 


Sequence 1 executes 2 + 1 + 2 = 5 instructions 





Sequence 2 executes 4 + 1 + 1 = 6 instructions 


the total number of clock cycles for each sequence: 
H 
CPU clock cycles = Y (CPI X C.) 


This yields 

CPU clock cycles, = (2X 1) + (1 X 2) + (2 X 3) = 2 + 2 + 6 = 10 cycles 
CPU clock cycles, = (4 X1)+(1X2)+(1X3)=4+2+33=9cyctles 
So code sequence 2 is faster, even though it executes one extra instruction. Since 


code sequence 2 takes fewer overall clock cycles but has more instructions, it 
must have a lower CPI. The CPI values can be computed by 


CPI = CPU "e cycles 
Instruction count 
CPI, - CPU n cycles; 102 T 
Instruction count; 5 


CPI. = CPU 8 cycles, 9 _ 15 
Instruction count; 6 


Dr. Ahmed Jaber Spring 2019 


36 


Instructions .. Clock cycles T Seconds 
Program Instructio Clock cycle 


CPU Time - Seconds/Program - 


I 


О Тһе Execution time depends on three factors: 

1. How many instructions must execute to complete a program? 
(Instructions per program) 

2. How many cycles does each instruction take to execute? 
Cycles per Instruction (CPI) or reciprocal, Insn per Cycle 
(IPC) 

3. How quickly does the processor cycle? Clock frequency (Ghz) 
(cycles per second) or expressed as reciprocal, Clock 


period (ns) (seconds per cycle) 


Exec. time= (Inst. /program) * (Cycles /Instr.) * (sec. /Cycle) 


v For minimum execution time, minimize each term 





Components of performance Units of measure 
CPU execution time for a program Seconds for the program 


Instruction count Instructions executed for the program 





‘Clock cycles per instruction (ОР!) Average number of clock cycles per instruction 
‘Clock cycle time seconds per clock cycle 
Performance depends on 
= Algorithm: affects IC, possibly CPI 
- Programming language: affects IC, CPI 
- Compiler: affects IC, CPI 
= Instruction set architecture: affects IC, CPI, Т. 


Dr. Ahmed Jaber Spring 2019 


37 


Million Instructions Per Second (MIPS): А measurement of program execution 
speed based on the number of millions of Instructions. MIPS 15 computed as the 
instruction count divided by the product of the execution time and 10°. 





" MIPS rate = —— „к 6 
| | CPIx10"° Тх10”° 


Example 


The following mix of instructions is executed on а 40-MHz 
processor: 


NN Е 


Integer arithmetic 


Data transfer 
Floating point 
Control transfer 








Calculate the effective CPI, MIPS and 





Solution 
CPI= > CPI x1 hj 
icClasses I 

2 1х45Е?--2х32Е?--2х15Е?--2х8Е?” кере 
100000 n 

6 

СРІ х10° 1.55х106 

r=} > IC,xCPI, 
J ieClasses 


— — > 1>x45000+ 2 x 32000+ 2x15000+ 2 x8000 = 3.875ms 
40x10 icClasses 
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(Exercise 1.3): Consider three different processors P1, P2, and P3 executing the same instruction 
set with the clock rates and CPIs given in the following table 


[Prosessor | cioekrato | em O OO 


1.5 
1.0 

P3 | 4 GHz 2.2 
a- Which processor has the highest performance expressed іп instructions per second? 
b- If the processors each execute a program in 10 seconds, find the number of cycles and the number of 
instructions. 
c- We are trying to reduce the time by 30% but this leads to an increase of 20% in the CPI. What clock rate 
should we have to get this time reduction? 





e. 


The performance of each processor is calculated by using the following formula: 


Clock Rate. . 
Performence ( P) = — — instructions per second 


CUP 





Ғог processor P1: 
s= ee 


Clock Rare 
Performeance(H) = ————— instructions per second 
CPI 
3x10” 
1.5 


I AB - - 
—2x10 imsiruciiens per sec ore 


іпзікмсііотк per sec onc 


= À š = ah = - | 
Thus, the performance of processor A із |2х10” |insfructions per second - 
Ғог processor P7: 


Clock Rate 


Performance(P,) = —Pr — Pnustrmructiions per second 
ЕЕ“ ' | 
а 
2.5 <10° 


= ————————— Insiructions per sec ond 


1.0 


IFP ë š ' 
—2.5xX10 imstructions per sec ond 


Thus, the performance of processor P, is |2.5>x 10” |imsrructions per second - 
For processor P3: 
Clock Rare 


Performance (ү = ------- instructions per second 
CPi 
4x10" , п u 
=— > instructions per sec ond 


š ә. P 
—1.81:«10^ імхікмегіот per sec ond 


> | 91. А 
Thus, the performance of processor P, is 1,81 x 10° | instructions per second 


As the performance is inversely proportional to the time, the processor with less time performs better. Thus, among 
the 2 processors, the least time is taken by the processor P, resulting in highest performance 


Thus, the processor P results in the highest performance expressed in instructions per second. 
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b. 
Consider the CPU time for executing each program is 10 seconds. 


The number of cycles and number of instructions for each processor is calculated by using the following formulae: 


Number of cycles( P) = Timex Clock Rate 


Number of cycles 


Number of instructions( P ) E 


IHRSIFUCIIOHS 





For processor P1: 
u—— —— À—— -%) 


Number of cycles ( В ) = Timex Clock Rate 
-І0х3х10” 
—30x10* 


Thus, the number of cycles for processor A is [303 1(? |- 


Number of cycles 


Number of instructions( P) = CPI 


Instruciioms 


— 30% 10° 
1.5 
= 20 10° 


Thus, the number of cycles for processor P is . 


For processor P2: 
—r n 


Number of cycles ( P, ) = Timex Clock Rate 
-10х2.5х107 
=25х10° 


Thus, the number of cycles for processor P, is : 
Number of cycles 
CPI 
| 25х10" 
= ETE 
-25х10” 


Thus, the number of instructions for processor P, is | 


For processor P3: 


Number of instructions(P,) = instructions 


Number of cycles ( Р, | = Timex Clock Rate 
-10x4x10* 
-40х10” 


Thus, the number of cycles for processor P, is |40x10°|- 


: Number of cycles 
Number of instructions(P,) = MUR иш instructions 
CPI 
| 40x10" 
2,2 


=]8.18x10° 


Thus, the number of instructions for processor E is 18. 
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For processor P1: 


CPI = 1.2 x CPI 
=1.2x1.5 
= |. 
CPI = 
Plumber of cycles ( F ) = Jime x Clock Rate 
—10: 3x10? 
=30 10" 
Thus, the number of cycles for processor A is |3010" |- 
Number of instructions( PF) = ee SE ЕЕ кке СЕ ТОЗ 
| СОРУ 
| 30 x1 0 
1.5 
— 20x10" 


Thus, the number of instructions for processor P is 20> 107] . 


РЫН ТЕНІЗ Тасып сас йына ыы со 
Time 
(20x10? x1.8) 
=. | 
36000000000 


X, 
= 5.14 c nz 


Thus, the Clock rate for processor P1 is|5.14 GHz]. 


4. 














Consider the old CPU time is 10 seconds. 


Mow, calculate the new CPU time as follows: 


i | (7 x CPI) 
CPU Time = MICI 


clock rare 





The time is decreased by 30%. 


WO xf 
100 
= (h. 7 r 
So, the CPU time іс Ts. 
CPI is increased by 205%. 


(120% CPI) 
100 
— 1.2 x CPI 


CPI = 


So, CPI = 1.2 x CPI . 


Calculate the clock rate to get the time reduction by using the following formula: 


(Митһеғ of instruction = CPT ) 


Clack rare = - 
Tre 





Calculate number of cycles and number Of instructions of each processor by using the following formulae: 


Number af cycles( P) = Timex Clock Rate 


» : : Plumber of cycles 
Number of instructions (P= —— [cT 


cri 


імзігисііоі5 
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Example 

Consider two different machines, with two different instruction sets, both of which 
have a clock rate of 200 MHz. The following measurements are recorded on the two 
machines running a given set of benchmark programs: 


Instruction 
Instruction Type Count (millions) 


Machine А 
Arithmetic and logic 
Load and store 
Branch 
Others 

Machine B 
Arithmetic and logic 























Load and store 


Branch 
Others 


a. Determine the effective CPJ, MIPS rate, and execution time for each machine. 
b. Comment on the results. 





a. 
cpi, -CPLA (8x16 4x 34244 4x3)x10' кке 
| Ç (8+4+2+4)х10 
6 
НЕН a ee КР 
CPI, x10° 2.22 х10 
cpu, = CPI, _ 18x10 x22 o2s 
f 200 x 10* 
cpi, -CELA (10х1+8х2+2х4+4х3)х10* |, 
1 (10 4 8 4 2.4 4) x 10 
FT & 


CPU, „© US БЕ ғғ. 


b. Although machine В has а higher MIPS than machine А, itrequires а longer 
CPU time to execute the same set of benchmark programs. 
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Power Trends 










10,000 3300 3400 120 
E Е 100 
ы 1000 EN 
= Clock Rate 200 80 2 
! `~ 77 
# 100 66 во € 
C = 2 
ut ч i 
= i Power 40 £ 
E 10 4 | 
20 
1 - { | | | 1 О 








(1989) | Y ! 
Pentium 
(1993) 
Pentium 4 
llamette 
2001) | 
Pentium 4 
Prescott 
(2004) 
Соге 2 
Kentsfield 
(2007) | 
Corei5 | 
Clarkdale 
(2010) 
Core i5 
lw Bridge 
(2012) 


Wi 


Pentium 
Pro (1997) 


w 


In CMOS IC technology 


Power = Capacitive load x Voltage ^ x Frequency 


For CMOS chips, traditional dominant energy consumption 
has been in switching transistors, called dynamic power 


* Because leakage current flows even when a 
transistor is off, now static power important too 


Power static - Currentstatic х Voltage 
* Leakage current increases in processors with 
smaller transistor sizes 
* Increasing the number of transistors increases 
power even if they are turned off 
Reducing Power 


Suppose a new CPU has 
= OdS% of Capacitive load of old CPU 
= 15% voltage and 15%0 frequency reduction 





Prew — Gow зе 9-85 = (М. >< 0-85)" x Fo 0-85 _ G 554 0.52 
Poria Coana >< Voa > Foia 


The power wall 

= VVe can’t reduce voltage further 

= VVe can’t remove more heat 

How else can we improve performance”? 
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Exercise 1.8 


Suppose we have developed new versions of a processor with the following char- 





acteristics. 


I — нщ 
1 кра | 4-4 


B | Version | Voltage 
Version 4 
1.8.1 How much has the capacitive load varied between versions if the 
dynamic power has been reduced by 10%? 
1.8.2 How much has the dynamic power been reduced if the capacitive 


load does not change? 
1.8.3 Assuming that the capacitive load of version 2 is 80% the 
capacitive load of version 1, find the voltage for version 2 if the dynamic power of 


version 2 is reduced by 40% from version 1. 


“Lk 































қ M ed emp CLecl Cote 

—á Ver * eR Li 1-7» (c GU 
verton A tav 2 C 

e, = сұхеыз%) MAS 

б» = Gy * бъ 23” x = 


Capa T veo lead deest chaye 2. C С = ©з. 
f2- oer, | 











2 
#з: _ Жж Cr) Tes = =-&3 > © 62 
е. Ж” Ғы?“ IS 4 < 
v< > біле. Ha Co pars 4% we { ——. 24 а Vem Kaos = is = c x. tans 
Ca рела t tond су Уелесе t, 
тс 


--> 94 есед 4 <J “су. 
es =Ç , — Счас) = - | es = б.4& Р, | 


a 
Ce Ve tenn о, Version t= і.7< v 
few Verson а Р, ~ e. = Сез” > ys S 
Үсазс- 2 @© = са» “>” x Ж 


тумо < Poit He va Жее і "Ұз Р. = O .& f, 


O.4 P, = бі = vx 2 
| O6, - 6- SC x< v^. > VS J 17 
©. £4 65-5)! 21] = б„@64 w VIE x Бету 
— Vs. 2-754 | 3 сик 
H ES: TEES 
Spring 2019 
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Amdahl’s Law 


While manufacturers make enormous efforts on improving the performance of 
processors, input/output devices like memory and storage devices are still too slow 
compared to the processors. This means that the overall speed improvement 15 limited 
by the low speed of input/output devices. In other words, at a certain point, 
manufacturers will need to pay more attention in improving I/O speed instead of 
processor's speed. 

What is Amdahl's Law? 

Amdahl's law is an expression used to find the maximum expected improvement to an 


overall system when only part of the system is improved. It is often used in parallel 


computing to predict the theoretical maximum speedup using multiple processors. 


Bounds on Speedup 





і. 
ft, (1 - At, 
serial section | Parallelizable sections 
(a)Oneprocessor | | || рр 
(b) Multiple | 
processors 


p processors 





- h 
1 (1 7 f)ts/p 
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Single Enhancement 
F: Fraction enhanced, S: Speedup enhanced 


Execution Time 


(without E) 


Execution Time 
(with E) 





1 


Speedup = т 
1-Р)+ ç 


e Speedup: is the maximum possible improvement of the system. 

e К (Fraction): is the part that can be improved. In other words, (1-F) is the part of the 
system that cannot be improved. 

e S (factor of improvement): is the performance improvement factor of F after applying 


the enhancements. 
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Execution time after improvement = 
Execution time affected by improvement _ . . . | 
—— O +t Execution time unaffected 
Amount of improvement 


- Relative to how it performed previously 


Performance with E - ЕхГІте before 
Performance before ExIime with E 


opeedup(E) - 





Enhancement improves a fraction f of execution time by a factor s 
(speedup E(f)) and the remaining time is unaffected 


opeedup(E) = 





(Fi s+(1—f)) 


Dr. Ahmed Jaber Spring 2019 


Computer Performance 
"X is N% faster than Y." 


Execution Time of Y N 


= Í 
Execution Time of X 100 


Using Amdahl s law 
Overall speedup if we make 9076 of a program run 10 times faster. 
Е= 0.9 65-10 


1 1 | 
Overall Speedup = — — D =  —————— 26 


. =5. 
(1—0.9) + m 0.1+ 0.09 


Overall speedup if we make 80% of a program run 20% faster. 
F=0.8 5-12 
1 1 
OverallSpeedup = 8 = 022066 =| as 


Example 
Let a program have 40 percent of its code enhanced (so F = 0.4) to run 2.3 times faster 


(so S= 2.3). What is the overall system speedup Е? 
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Solution 


(1- F) * (F/S) у 


Step 1: Setup the equation: -( 
(1—0.4) + (0.4 / 2.3) ' 
06+ 0.174 )' - 1/074 


1.292 


Е 
Мер 2: Plug in values & solve Ё 


u j 


( 
| 
| 


Ехатр!е 


Suppose а program runs in 100 seconds on a computer, with multiply operations responsible 
for 80 seconds of this time. 
a- How much do I have to improve the speed of multiplication 1f I want my program to run 
four times faster? 
b- How much do I have to improve the speed of multiplication if I want my program to run 
five times faster? 

Answer 


Execution time after improvement — 
Execution time affected by improvement 


— | + Execution time unaffected 
Amount of improvement 


100 80 
~~ = (100 —80) + = 
25 = 20 4 — 

S 


5= ç == s= 16 
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Execution time after improvement = 
Execution time affected by improvement 


| : + Execution time unaffected 
Amount of improvement 


For this problem: 


Execution time after improvement — NND -- (100 — 80 seconds) 


Since we want the performance to be five times faster, the new execution time 
should be 20 seconds, giving 


20 seconds - ss onos + 20 seconds 
80 seconds 


0 = Ы 


That is, there is по amount Бу which we сап enhance-multiply to achieve a fivefold 
increase in performance, if multiply accounts for only 80% of the workload. 
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How can I Make the Program Run Faster? 


N x CPI x (1/f) 


m Reduce the number of instructions 
= Make instructions that ‘do’ more (CISC) 
= Use better compilers 


m Use less cycles to perform the instruction 
= Simpler instructions (RISC) 
" Use multiple units/ALUs/cores in parallel 


ш Increase the clock frequency 
= Find a ‘newer’ technology to manufacture 
= Redesign time critical components 
= Adopt pipelining 
Homework # 1 


B Exercises in the Textbook (Computer Organization & Design, by 
Patterson & Hennessy, 5th Edition). 


B 1.3 
B 1.5 
B 1.6 
B 1.7 
B 1.5 
B 1.10 
B 1.14 
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Chapter 2 
Instructions (Language of the Computer) 


The words of a computer's language are called instructions, and its vocabulary is called 
an instruction set. 


The instruction set, also called ISA (Instruction Set Architecture), 1s part of a computer 
that pertains to programming, which 1s basically machine language. 


The instruction set provides commands to the processor, to tell it what 1t needs to do. 


The instruction set consists of addressing modes, instructions data types, registers, 
memory architecture, interrupt, and exception handling, and external I/O. 


An instruction set can be built into the hardware of the processor, or it can be emulated 
in software, using an interpreter. The hardware design is more efficient and faster for 
running programs than the emulated software. 


Computer designers have a common goal: to find a language that makes it easy to build 
the hardware and the compiler while maximizing performance and minimizing cost and 
energy. 


How to classify ISA? 
е Based on complexity 
- Complex Instruction Set Computer (CISC) 
- Reduced Instruction Set Computer (RISC) 
e Parallelism / Word size 
- VLIW (very long instruction word) 
- LIW (long instruction word) 


- EPIC (explicitly parallalel instruction 
computing) 
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RISC vs CISC 


О RISC (reduced instruction set computer) instructions 
v only load/store instructions access memory 
v  operands (data) must be in registers to perform operation 
v each instruction roughly taking same amount of time 
м 


simple addressing modes 


О CISC (complex instruction set computer) instructions 
v АШ instructions access memory to fetch operands 
v  load/store instructions access memory 
v some instructions' execution time is much longer than other 
instructions 


v complex addressing modes 





Data can be stored in registers or memory 
locations. Memory access is slower (takes 
approximately 5O ns) than register access (takes 


approximately 1 ns or less). 


CO ao wat СО (Ұл 4 c ГӘ -— со 


To increase the speed of computation it pays to 
keep the variables in registers as long as possible. T 
However, due to technology limitations, the number 


of registers is quite limited (typically 8-64). 





Memory can be viewed View registers as 
as a bookshelf spaces on your table 





MIPS has 32 registers rO-r31. 
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Program 








Memory and Registers 
Addresses аге 32-bits words 
— 232 different locations 


Words are 32-bits, or 4 bytes 
- 239 addressable words 











so 

за 

ж 

$3 

жа 

ше 

зе 

ST 

se 

so 
510 
$11 
= 1 = 
$13 
5124 
515 
іе 
i177 
сіз 
519 
$20 
$21 
$22 
$23 
S24 
$25 
S26 
S27 
оз 
529 
зо 
$31 
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e Word address must be aligned 


Registers are also word-sized 


e Only 32 general purpose 
e Special: PC, Status, HI, LO, ... 
e Floating point registers, ... 


MIPS registers 


u LL I [| ME 


= ==. ср ән. 


= cad 
== s UO 
Swi 


= =. L 
= == 
= == 
= t= D 
= 
Sis 
zi 
s= t= = 
SiS 
= t = 
Si 
= == Ü 


= = L 


= == 
= == 
= == 
S35 


| = == 





== . 


Ske 
=+ 
e aco 
зкі 
S ge 
$ Sp 
ғұ» 


fore 





Reserved for assembler use 


+ 


Sad 


T 
J- 


Procedure results 


Procedure aA ved 
arguments HM 
Temporary 
walu es 
Saved 
across 
Operands procedure 
calls 
More 


temporari es 


Reserved for OS kernel) 


Global pointer 


Stack pointer 


Saved 


Frame pointer 
Return address 
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Мате Number Description 

$zero Ü constant 0 

$at 1 assembler temporary — do not use 
$vO-$v1 | 2-3 function select; return result 
Фа0-фаЗ | 4—7 function arguments 
$t0-$t9 | 8—15, 24-25 | temporaries 

$s0-$s7 | 10-23 saved registers 

$kO-$k1 | 26—27 kernel registers — do not use 
$gp 28 global heap pointer 

$sp 20 stack pointer 

$fp 30 frame pointer 

Фга 31 return address 


Reqisters Used 
in [his Chapter 


$8 Sto 10 temporary registers 
59 51 
510 562 
511 $t3 
$12 Sta 
513 5:5 
514 56 
515 St7 
$16 580 
517 551 
518 582 Saved 
$19 $53 across 
520 Sad -= procedure 
521 585 calls 
522 556 
523 $657 ~ 





524 
525 





sts | More 
5-9 + temporaries 
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Assembly language programs 


What іс an Assembler? 






ЕР” simple piece 
of software 


Assembly 
Language 


lw tO, 32($s3) 


add $s1, $s2, $tO 





Register Operands 
Ы Ex 1 
v Ccode: f = (g + h) - (i + j) 
v MIPS code: 


add $t0, $s1, $s2 # $t0=$s1+$s2 
add $t1, $s3, $s4 # $t1=$s3+$s4 


sub $s0, $t0, $t1 # $s0=$t0-$t1 
Ј Ex 2 
add $t0, $s1, $zero # use zero reg. to move between registers 
Q Ex 3 
addi $s1, 552, 4 # immediate operand 
1 Ex 4 
addi $s1, $s2, -1 # No subtract immed. instr., use a negative con. 
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Note: Note that "sub" has a similar format. 


- Instruction Formats 












High-level language statement: а = P + с 
"s ый a 
Assembly language instruction: add sts, $s2, $s1 
Machine language instruction: QOOOOO 10010 10001 11000 OOOOO 100000 
AL U-type Register Register Register Addition 
instruction 18 17 24 Unused opcode 
Register Register 
Instruction file Data cache file 
= | (not used) | 
— — |524 
Instruction Register | - Data Register 
Operation | | іне ! 
fetch readout read/store writeback 


Memory Operands 
Ш Main memory used for composite data 
Ш Arrays, structures, dynamic data 
B To apply arithmetic operations 
E [ оаа values from memory into registers 
Ш Store result from register to memory 
B [оаа word has destination first, store word has destination last 
B Memory is byte addressed 


Ш Each address identifies an 8-bit byte. (memory accessed by a load/store 1s a 
byte). 1.e. loading words but addressing bytes 


v 232 bytes with byte addresses from О to 232-1 
v 230 words with byte addresses О, 4, 8, ... 232-4 


B Words are aligned in memory 


Ш Address must be a multiple of 4 
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B Inthe case of a 32-bit word length, natural word boundaries occur at addresses 0, 4, 8, ..., as 


shown. We say that the word locations have aligned addresses . 


| 1 1 10000 | word O 


OOOOCXCO 


32 bits of data 
32 bits of data 


o + O 


32 bits of data 
32 bits of data 
"s — — — BEL] 


= 
N 








ч O а + оғым ^ O 


B byte Order (Big Endian and Little Endian) an 


Ш Big Endian Byte Order: The most significant byte (the "big end") of the data 1s 
placed at the byte with the lowest address. The rest of the data 1s placed in 
order in the next three bytes in memory (MIPS) 


B Little Endian Byte Order: The least significant byte (the "little end") of the data 
is placedat the byte with the lowestaddress. The rest of the data 1s placed 1n order 


in the next three bytes in memory (Intel 80x86). 


Example 
= In С, ant num = Ох12 345678; // a B2-bit word, 


= how is num stored in memory? 





ime 4n+3 pmo 
лт+2 ап+2 34 o 
an+ 4n+t BG 
Ano апо z8 

(Little Endian ` 
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v MIPS is Big Endian so bytes in а word іс numbered ас 


follows: 
м“ byte О at the leftmost (most-significant) to byte 3 at the 


rightmost (least-significant). 





word O 





word 1 











N OA Q A WN P O 





4 bytes per word Memory 


Up to 232 bytes = 239 words 










$0 | Execution & Floating 
32 General $1 Integer Unit Point Unit 
Purpose --—+T----------- - $2 | (Main proc) (Coproc 1) | 32 Floating-Point 
Registers | | r—4 | [—4 | po  4—-——------i-- Registers 
Arithmetic & | 
Logic Unit | | 
. Floating-Point 


Arithmetic Unit 


3adVaddr Trap & 
atus | Memory Unit 
| (Coproc 0) 





Integer 
Multiplier/Divider 
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Memory Alignment 


The memory is typically aligned on a word or double- 
word boundary. 


Ап access to object of size 5 bytes at byte address A is 
called aligned if A mod S = 0 


Access to an unaligned operand may require more 
memory accesses !! 


Memory Alignment on different architectures 


Memory Alignment Alignment Alignment Alignment 


address (8 bit) (16 bit) (32 bit) (64 bit) 











0x0000_0000 Aligned Aligned Aligned Aligned 


0x0000_0001 Aligned Non Aligned Non Aligned Non Aligned 


0х0000_0002 








0х0000 0005 Aligned Non Aligned Non Aligned Non Aligned 





Б 





ligne 





Aligned 











0x0000_ 0007 Aligned Non Aligned Non Aligned Non Aligned 
0x0000 0008 Aligned 





| 4 
[3 (MSB)| 2 | 1 | O (LSB) | 
x 4 _ 
LO(LSB)| 1 | 2 |3 (MSB) | 


Increasing byte 
address 





Word-aligned word at byte address 4. 


Halfword-aligned word at byte address 2. 
x l 


Byte-aligned (non-aligned) word, at byte address 1. 






Little-endian byte order 


Big-endian byte order 
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Byte address 
0x1001 0000 





[0x10000000]... 


[0x10010000] 
[0x10010010] 
[0x10010020] 
[0x10010030] 
[0x10010040] 


[0x10010050],/ 


[0x10010060 

[0x10010070] 
[0x10010080] 
[0x10010090] 
[0x100y00a0] 


[0х100100ь0].. 


Ог. Ahmed Jaber 


Byte address 
0x1001 0004 


0432812е32 
0x2c202a2a 
§x2a437572 


| 0x3e3e3532 


0x2e362c20 
0х202а2а4/ 
0x432047Ab 
0х6Ғ46202с 
0x4d52203b 
0х202с302е 
0х0000322е 


.[0x10040000] 


Byte adress 


0x466f7265 
0х72656е74 


0x3a203c3c 
0x32332e30 
2x524d202a 
0x73743a20 
Ox3e3e3439 
0x2a2a4b47 
0x65727275 
0х61636572 
0x72754820 
0х65726#46 Ox 
0x00000000 | 
900000000 















Spring 2019 


These are aligned 
addresses for Iw. 


Byte address Byte address Byte address 
0x1001 0018 0х1001 001a 0x1001 002c 0х1001 002e 


61 





Example: Loads апа Stores 





Before | Assembly Code 
| | e Initially $s0 = 100100006 


10010000 | 7С0802А6;; lw $tO, ($s0) 
10010004 | BES1FFD01e lw $ti, 4($s0) 
sw $tO, 4($s0) 
sw $ti1, ($s0) 
DM 


10010000 | BE81FFD0,. B | 
10010004 | 7C0802A6;« pM 


$t1— BES81FFD04e 











Example 
- C code: 

g = h + A[8]; 

‚ g in $s1, h in 552, base address of A іп $s3 qr А088 M 

| base register 

- Compiled MIPS code: | 

- Index 8 requires offset of 32 | 

- 4 bytes per word Offset - 4i 
Iw $t0, 32(553) # load word —— A Element ; 
add $s1 52, a ofamay А 
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Example 
C code: 
A[12] = h + АГВ1; 
- hin $s2, base address of A іп $s3 
Compiled NIIPS code: 
- Index 8 requires offset of 32 


Tw $tO, 32C%$s3) Ж load word 
add $tO, %-2, S$tO 

Sw $tO, 48C$s3) Ж store word 
Note 


MIPS register O (Фгего) is the constant O 
. Cannot be overwritten 
Useful for common operations 
= E.g., move between registers 
add $t2, $51, %гего 
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Binary Representation of Integers 





Number can be represented in any base 


- Hexadecimal/Binary/Decimal representations 
ACE7,,, = 1010 1100 1110 0111,; = 44263... 
« most significant bit, MSB, usually the leftmost bit 
« least significant bit, LSB, usually the rightmost bit 
- |deally, we can represent any integer if the bit width is 
unlimited 


Unsigned Binary Integers 
— Given an n-bit number 





= Range: О to +2" — 1 
- Example 
- 0000 0000 0000 0000 0000 0000 0000 1011; 
= 0 +... + 1=23 + 0x2? +1=21 +1=20 
=0+..+4+84+40+2+1= 119 
- for a 8-bit byte > 0-255 (0~28 — 1) 
» fora 16-bit halfword > 0~65,535 (0-216 – 1) 


a for a 32-bit word > 0--4,294,967,295 (0-2?? — 1) 
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2s-Complement Signed Integers 


Given an n-bit number 
X — x, HX 2 +---+ ЖМ 


Range: 2"^-! to -2n- 1 — 1 
Example 
1111 1111 1111 1111 1111 1111 1111 11005 


= —1 x231 + 1x230 +... + 1x22 +0х21 +0х20 
= —2,147,483,648 + 2,147,483,644 = —4,, 


Using 32 bits 
—2,14/,483,648 to -2,14/,483,64/ 


Sign Extension 


Representing a number using more bits 
Preserve the numeric value 
In MIPS instruction set 
addi: extend immediate value 
ТЬ, lh: extend loaded byte/halfword 
ресі, bne: extend the displacement 
Replicate the sign bit to the left 
c.f. unsigned values: extend with Os 
Examples: &-bit to 16-bit 
+2: ОООО 0010 => 0000 0000 0000 0010 
© —2: 1111 1110 == 1111111: 1111110 


Тһе MSB implicitly serves as the sign bit 

2's complement of 10000000 = 10000000 
this number is defined as —128 

If the bit width is n 
range Э —2n-1 ~ 2n-1 — 1; 2" different numbers 
e.g., for a byte > —128 ~ 127 

Relatively easy hardware design 
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MIPS Instruction Formats 


о All MIPS instructions are encoded in binary. 
о All MIPS instructions are 32 bits long. 


о All instructions have: ор (or opcode): operation code (specifies the 
operation) (first 6 bits) 


Instruction Format 


| General Syntax 





Three operands Two operands 
op dst, src, src op dst, src op 
op dst, src, imm op dst, imm op src 


op operation code, or mnemonic 
dst destination register 
src source register 
imm immediate value (16-bit) 
e encoded in the instruction 


o [here are three instruction categories: 


6 5 5 5 5 6 
%-------- %-------- %------- %------- %------ %-------- - 
R-type format| Ор-соде | R. | R | R | SA  |Funct-code 
%-------- %-------- %------- %------- %------ %-------- - 
6 5 5 16 
%--------%-------- %------- %------------------------ 
I-type format|Op-code | R. | R | 2's complement constant 
%---- ----%-------- %------- %------------------------ 
6 26 
%-------- %--------------------------.-..-..-..-..-..-.-..-.-.-.---- 
J-type format] Ор-соде | jump target 
+= m шш шш шш шш шш шш + шш шш шш шш шш шш шш шш шш m m шш m шш ma шш m шш m шш m шш m шш шы шы шы на шы к. шы шш шы к. шы к. шы шш на шш шш 
bit 31 bit 
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(1 МІР5 

v Reduced Instruction Set 
Computer (RISC) 

v # 200 instructions, 32 bits 
each, 3 formats 


v all operands in registers 


< 


registers are 32 bits each 
v = 1 addressing mode: 


Mem[reg + imm] 
Multiple 


v  R-type 


Q x86 

v Complex Instruction Set 
Computer (CISC) 

v > 1000 instructions, 1 to 15 
bytes each 

v operands in dedicated 
registers, GPR, memory 

v can be 1, 2, 4, 8 bytes, 
signed or unsigned 

Y Multiple Of addressing modes: e.g. 
Mem[segment + reg + 
reg*scale -- offset] 


v Uses three register operands 


v Used by all arithmetic and logical instructions 


v I-type 


М Uses two register operands апа ап address/immediate value 


value 


v Used by load and store instructions 


v Used by arithmetic and logical instructions with a constant 


v  J-type 


Yv Contains a jump address 


¥ Used by Jump instructions 


Dr. Ahmed Jaber Spring 2019 


67 


О R-format (R for aRithmetic) 























m 6 bits 5 bits 5 bits 5 bits 5 bts — 6 bits 


v Instruction fields 
op: operation code (opcode) Have ор e. (R-format! 
rs: first source register number 


rt: second source register number 


vl 

vl 

V 

Y rd: destination register number 

v  shamt: shift amount (how many positions to shift) 
V 


funct: function code (extends opcode) 


v Ааа StO, $s1, $52 











000000 | 10001 | 10010 | 01000 | 00000 | 100000 


00000010001100100100000000100000; = 02324020;s 
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Integer Add /Subtract Instructions 


Instruction | Meaning | R-Type Format 
551 = $s2 + $s3 | op = 0 | rs = $s2 [rt = $s3|rd = $s1| sa = 0 | f = 0x20 
addu $s1, $s2, $s3| $s1 = $s2 + $s3 


sub 551, $52, $53 $51 = $52 — $53 | op = 0 |rs = $52 |rt- $s3|rd = $s1| sa = 0|f = 0x22 
51 = 552- 53 | op = 0 [rs = $52 |rt= $s3|rd = $s1|sa = 0 |f = 0х23 


* add & sub: overflow causes an arithmetic exception 





+ In case of overflow, result is not written to destination register 





* addu & subu: same operation as add & sub 
+ However, no arithmetic exception can occur 


+ Overflow is ignored 
схагпрте 


“» Consider the translation of: f = (gh) — (1-)) 


* Compiler allocates registers to variables 
<+ Assume that f, g, h, i, and j are allocated registers $s0 thru $54 
<+ Called the saved registers: $50 = $16, $s1 = $17, ..., $57 = $23 
“” Translation of: f = (gh) — (I+J) 


addu Sto, 551, Ss2 # 5+0 = gq + h 
addu $ti1, 553, 554 # Stl = i + j 
subu 650, $tO, $t1 Ё f = (аға)-(143) 


< Temporary results are stored іп $t0 = $8 апа $t1 = $9 


«* Translate: addu $tO0,$s1,$s2 to binary code 
ор rs = $51 rt = $52 rd = StO sa func 


“» Solution: 
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Pange, Carry, Borrow, and Overflow 


* Bits have NO meaning. The same n bits stored in a register 
can represent an unsigned or a signed integer. 


“» Unsigned Integers: n-bit representation 


Numbers < O Numbers > max 


Borrow = 1 BAR Е К Carry = 1 
A Finite Set of Unsigned Integers Addition 





min — O max = 2”—1 


“» Signed Integers: n-bit 2s complement representation 


Numbers < min Numbers max 
—~—_-a—n #=—— дЛ7% 


Negative tue ° j Positive 





min = -2”' О max -271-4 
Carry апа Overflow 
“» Carry is useful when adding (subtracting) unsigned integers 


< Carry indicates that the unsigned sum is out of range 


“» Overflow is useful when adding (subtracting) signed integers 


< Overflow indicates that the signed sum is out of range 
“” Range for 32-bit unsigned integers = О to (232 — 1) 
4$ Range for 32-bit signed integers = -231 to (2?! — 1) 
«* Example 1: Carry = 1, Overflow = О (NO overflow) 


11111 1 1 11 1 
" 1000 0100 0000 0000 1110 0001 0100 0001 
1111 1111 0000 0000 1111 0101 0010 8000 


1000 0011 0000 0001 1101 0110 0110 0001 


Unsigned sum is out-of-range, but the Signed sum is correct 
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% Example 2: Carry = 0, Overflow = 1 
01111 1 11 1 


0010 0100 0000 0100 1011 0001 0100 0100 
0111 1111 0111 0000 0011 0101 0000 0010 


1010 0011 0111 0100 1110 0110 0100 0110 
Unsigned sum Is correct, but the Signed sum Is out-of-range 


% Example 3: Carry = 1, Overflow = 1 
1 11 1 11 1 


1000 0100 0000 0100 1011 0001 0100 0100 
1001 1111 0111 0000 0011 0101 0000 0010 


0010 0011 0111 0100 1110 0110 0100 0110 
Both the Unsigned апа Signed sums аге out-of-range 
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Integer Multiplication & Division 


> Consider axb and a/b where а and b are іп $s1 and $s2 


< Signed multiplication: mult $s1,$s2 
< Unsigned multiplication: multu $s1,$s2 
<> Signed division: div 551,552 
< Unsigned division: divu 551,552 


“» For multiplication, result is 64 bits 
+ LO = low-order 32-bit and HI = high-order 32-bit 
“» For division 
+ LO = 32-bit quotient and HI = 32-bit remainder HI 
<> If divisor is О then result is unpredictable 
“» Moving data 
< mflo rd (move from LO to rd), mf£hi rd (move from HI to rd) 
^ mtlo rs (move to LO from rs), mthi rs (move to HI from rs) 





Instruction | Meaning | 
rs? 
multu rs, rt 
div — rs, rt rs? 
divu rs, rt 





“» Signed arithmetic: mult, div (rs and rt are signed) 
+ LO = 32-bit low-order and HI = 32-bit high-order of multiplication 
+» LO = 32-bit quotient and HI = 32-bit remainder of division 

* Unsigned arithmetic: multu, divu (rs and rt are unsigned) 


“ NO arithmetic exception can occur 
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О I-format (I for Immediate) 


constant or address 





6 bits 5 bits 5 bits 16 bits 


Immediate arithmetic and load/store instructions 
= rt: destination or source register number 

= Constant: 227? to +21 — 1 

- Address: offset added to base address in rs 


v Ex 
v addi $s1 , $s2 , 100 


а |е Ó] 3 ] 39  - 






ooiooo| ——— ” 


6 bits 5 bits ^5 bits 
Load and Store Word 


“” Load Word Instruction (Word = 4 bytes in MIPS) 
lw Rt, imm1° (Rs) # Rt = MEMORY I[Rs-imm:*] 


< Store Word Instruction 
sw Rt, imm!?(Rs) i MEMORY [Rs+imm?*] = Rt 


“» Base or Displacement addressing is used 
< Memory Address = Rs (base) + Immediate’® (displacement) 


< Immediate? is sign-extended to have a signed displacement 


Base or Displacement Addressing 
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v^ Iw БЕО, 1002(S$s2) 


ss ] as | s - 1002 





100011 10010 o1000 ОООООО1111101010 


6 bits 5 bits 5 bits 16 bits offset 


MIPS assembly language 















Category | Instruction Example Meaning | Comments 


іші [iw $51,100(852) [$sl=Menoryt$s2-+ 100] [Data tom memoytoregster — 
storeword |= $51,100($s2) |Memory[$s2 100] 2 $s] — | Data from register to memory 


MIPS machine language 


Кате | Format | Example |Comments 








Load and Store Byte and Halfword 


“ The MIPS processor supports the following data formats: 
<> Byte = 8 bits, Half word = 16 bits, Word = 32 bits 


“ Load & store instructions for bytes and half words 
+ Ib = load byte, Ibu = load byte unsigned, sb = store byte 


MD le al a ale ale al 


<+ Ih = load half, Ihu = load half unsigned, sh = store halfword 





32-bit Register 


D zero - extend 0 








5 
D zero — extend 0 
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Load and Store Instructions 


lb Rt, imm(Rs) | Rt €, MEM[Rs+imm] Rt | 16-bit immediate 


їн Wt, ime) | Rt €, Wen[Rssimm] | өкіз | вз | mt | ie-bit immediate 


“” Base / Displacement Addressing is used 





<> Memory Address = Rs (Base) + Immediate (displacement) 
<+ If Rs is $zero then Address - Immediate (absolute) 


<> If Immediate 15 О then Address - Rs (register indirect) 
Example 


We want to load a BYTE into $s3 from the address 2000 


After the load, whatis the value of $s3 ? 


1999 | 
2000 11111111 
2001 11111111 
1111 1111 
1111 1111 





Assume 
%50 - 2000 


А1 Unsigned => lbu $s3, 0(550) 
A1: 0000 0000 0000 0000 0000 0000 1111 1111 (255)? 


A2 Signed | "ль 553, 0(550) 
A2: 1111 1111 1111 1111 1111 1111 1111 1111 (—1) ? 
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32-bit Constants 


“> 1- Туре instructions can have only 16-bit constants 





ж What if we want to load a 32-bit constant into a register? 

“> Can't have a 32-bit constant in І-Гуре instructions © 
<> Гһе sizes of all instructions are fixed to 32 bits 

4» Solution: use two instructions instead of one © 


+ Suppose we want: $t1 = OxAC5165D9 (32-bit constant) 


lui: load upper immediate Upper Lower 
16 bits 16 bits 
lui $t1, OxAC51 $t1 | OxAC51 охөөөө 


ori $t1, $11, Өх65ро $t1 OxAC51 Ox65D9 


32-Bit Immediate Operands 


Most constants are small 16-bit immediate 1s sufficient. The MIPS instruction set includes 
the instruction load upper immediate (lui). 


lui (load upper immediate): 


Transfers the 16-bit immediate constant field value into the left most 16 bits of the register 
(upper half word of register), filling the lower 16 bits with Os. 


LJ Ex: 





Lui $tO, 61 | 0000 0000 0011 1101 |0000 0000 0000 0000 


Ex: What is the MIPS assembly code to load this 32-bit constant into register $50? 


0000 0000 0011 1101 0000 1001 0000 0000 
61 





Lui $tO, 61 0000 0000 0011 1101 [0000 0000 0000 0000 
Ori ФЕО, $tO, 2304 





ОООО 0000 0111 1101 |рооо 1001 0000 0000 
2304 
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Госіса! Operations 


(Instructions for bitwise manipulation) 


Shift left 
Shift right 


Bitwise OR 
Bitwise NOT 


B Useful for extracting and inserting groups of bits in a word 


Ш Shift Operations 

B Shift left logical 
Ш Shift left and fill with 0 bits 
B sll by i bits multiplies by 2: 

B Shift right logical 
Ш Shift right and fill with O bits 
B srl by i bits divides by 2: (unsigned only) 

Ш Shift instruction format (R- format) 








6 bits 3 bits 5 bits 5 bits 3 bits 6 bits 
shamt: how many positions to shift 


Logic Bitwise Operations 


% 
9 


Logic bitwise operations: and, or, xor, nor 


O O O 1 





O O 

1 1 O 
O O O 
1 1 O 


“> AND instruction is used to clear bits: x and ө Э ө 

“> OR instruction is used to set bits: х or 1 —* 1 

< XOR instruction is used to toggle bits: х xor 1 — not x 
> NOT instruction is not needed, why? 


not $t1, $t2 is equivalent to: nor $t1, $t2, $t2 
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B AND Operations 
Ш Useful to mask bits In a word (clear some bits to 0) 


B OR Operations 
B Useful to include bits in a word,( set some bits to 1) 


B NOT Operations 
Ш Useful to invert bits in a word Change 0 to 1, and 1 о 0 
B MIPS has NOR 3-operand instruction 
E a NOR b == NOT (aOR b) 


nor $tO, *$tl, $zero 





Register 0: always 
read as zero 


511 | 0000 0000 0000 0000 0011 1100 0000 0000 
ФїО |1111 1111 1111 1111 1100 0011 1111 1111 


Ы sil: Shift left logical, Shift left and fill with 0 bits 
511 $s1, $s2, 10 
1 srl: Shift right logical, Shift right and fill with 0 bits 
srl 551, $s2, 10 
J апа: and operation, select some bits, clear others to 0 
and %50, 551, 552 
Ы or: or operation, Set some bits to 1, leave others 
ог 550, %51, $s2 
J not: not operation, Change 0 to 1, and 1 to 0 


nor $to, $t1, $zero # a NORb = NOT (a OR b) 
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Logical Bitwise Instructions 


Instruction | Meaning | R-Type Format 
s1 = $s2 & $s3 


opo saco 
512 852833 | op = 0 [rs = $s2|rt- 62 [saco 
s1 =$s2^$s3 | op = 0 [rs = $s2| t= $s3|rd = 581 |за = O 
$52, $53 Гор = О [rs = $s2| rt= $s ETT 





51 = -($s2|$s3) rd = $s1 


< Examples: 


Assume 5=1 = Oxabcd1234 апа $s2 = OxffffOOOO 


and SsO,$s1,Ss2 4 SsO = Oxabcdoooo 
Or $sO,Ss1,$s2 # SsO = Oxffff1234 
хок $s0,$s1,$s2 4 550 = Ox5432123A 
nor $s0O,$s1,$s2 # S$sO = OxOOOOedcb 


Shift Operations 
*% Shifting is to move all the bits in a register left or right 
“» Shifts by a constant amount: s11, srl, sra 
< s11/sri1 mean shift left/right logical by a constant amount 
< The 5-bit shift amount field is used by these instructions 


< sra means shift right arithmetic by a constant amount 


< The sign-bit (rather than 0) is shifted from the left 


<11 4———————————— ç 32-bit register ——————————— 
srl 


ora 


shift-in sign-bit раса а a at shift-out LSB 
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Shift Instructions 
Instruction 


511 $t1,$t2,10 $t1 = $t2 << 10 
srl $t1,$t2,10 $t1 = $t2 >>> 10 
sra $t1,$t2,10 $t1 = $t2 >> 10 


sllv $t1,$t2,$t3 | $t1 = $t2 << $t3 
srlv $t1,$t2,$t3 | $t1 = $t2 >>>$t3 
srav $t1,$t2,$t3 | $+1 = $t2 >> $t3 


| 
| 
=P 






% sll, srl, sra: shift by a constant amount 








+ [he shift amount (sa) field specifies a number between 0 and 31 


* sllv, srlv, srav: shift Бу a variable amount 








+ A source register specifies the variable shift amount between 0 and 31 


Examples 
4 Given that $t2 = Oxabcd1234 and $t3 = 16 


sll $t1, $t2, 8 $t1 - Oxcd123400 
srl $t1, $t2, 4 $t1 = OxOabcd123 
sra $t1, $t2, 4 $t1 - Oxfabcd123 
srlv $t1, $t2, $t3 $t1 = Oexeo00abcd 


ор | Rs = $t3 | Rt = $t2 | Rd = $ti srlv 
600000 01011 01010 01001 60000 | 000110 
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<> Example: multiply $s1 by 36 
= Factor 36 into (4 + 32) and use distributive property of multiplication 


< 552 = Ss1*36 = 551%(4 + 32) = 551%4 + $s1*32 


sll StO, 551, 2 1 = 551 * 4 
sll $t1, 551, 5 | = 551 * 32 


addu 552, $tO, S$t1 = 551 * 36 





Note: Logical AND immediate and logical ОК 1mmediate put Os into the upper 16 bits to 
form a 32-bit constant, unlike add immediate, which does sign extension. 


MIPS assembly language 
Category | Instruction Comments 


sl, $52, 553 — = $52 & $532 Three reg. operands; bit-by-bit AND 


т — s uM ® Three reg. operands; bit-by-bit OR 

шт [mr $51 $52, r= $1 = ~ Е 0 |$53) Three reg. operands; bit-by-bit NOR 
Logical and immediate andi $s1,$s2,100 Bit-by-bit AND reg with constant 

or immediate ori $%1,$52,100 Bit-by-bit OR reg with constant 

shift left logical sl $1,$2,10 


shift right logical sl $%51 $5210 51 = $52 >> 10 Shift right by constant 
MIPS machine language 


Format Example Comments 


аа | R о в 9 0 % ай $1,852,853 — 
ч | R 0 в 5 0 % о $1 $52,853 /- 
=" | К ж 15m 





wi |-3-]8 эъ [и — 45 ore 
mo | * p b s 8 p mssi — 
s | * B b |s | [e p insiso 
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І-Туре ALU Instructions 


"мана | мешш n s| | жюн» _ 
addi %%1, $t2, 25 
andi $t1, $t2, 25 


ti = 


< addi: overflow causes an arithmetic exception 





<> Іп case of overflow, result is not written to destination register 
** addiu: same operation as addi but overflow is ignored 


** Immediate constant for addi and addiu is signed 


<> No need for subi or subiu instructions 


% Immediate constant for andi, ori, xori is unsigned 


Examples: I-Type ALU Instructions 


% Examples: assume А, В, C are allocated $s0, %51, $52 


A = В-5; translated ав addiu $s0,$s1,5 
C = B-1; translated ав addiu 552,551,-1 








A = B&Oxf; translated as andi $s0,$s1,0xf 
C = В|ОхҒ; translatedas ori $s2,$s1,0xf 
C = 5; translated as ori $s2,$zero,5 
A = B; translated as ori $s0,$s1,0 
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http://www.mipsim.tk/download.aspx 
(MIPS) Microprocessor without Interlocked Pipeline Staees 








222 Sourcel.asm* - MIPSim m | 25 

















File Edit View Debug Tools Help 


Ea | | a m 11416 @ | Z ж 
addi  $s0,5s1,5 
addiu $s0,5$s1,5 








addi $s2,5s1,-1 | MIPS Machine Code 
addiu $s2,5$5s1,-1 

addi &30, 531,5 Representation 
addiu $30,$31,5 (^5 Binary 

addi $52,2381,-1 x. ¿L 
addiu $82,2281,-1 Е 

andi 30,581, n xE 

ori &з2,531,0х# Е 

ori &32,ӧғего, 5 Show instructions 
ori $s0,5$31,0 І] Show comments 


andi  $5s0,5s1,0xf 2232ffff 


2 Od OY t ds Là he 


Ori 5-2, 551, О 2632ғғғғ 
огі Ss2,Szero,b5 3230000£ 
ori $s0,$s1,0 3632000F 


34120005 
36300000 


ча wa ча ча ча ча Te пш 


Format 


MIPS Code is successfully executed іп 125 milliseconds. 
В instructi ted 

- 4 arithmetic instructions 

- 4 logic instructions 











Line 10 Col1 Сһаг1 Insert 
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Control Flow 


Branch and Jump Instructions 


Decision making instructions 

= alter the control flow, i.e., change the "next" instruction to be executed 
Branch classifications 

- Unconditional branch 


Always jump to the desired (specified) address 
- Conditional branch 


Only jump to the desired (specified) address if the condition is true; 
otherwise, continue to execute the next instruction 


Destination addresses can be specified in the same way as other 
operands (combination of register, immediate constant, and memory 
location), depending on what addressing modes are supported in the |SA 


Control Instructions 


Used if you do not execute the next PC value. 


Transfer control to another part of the instruction space. 


Two groups of instructions: 
е branches 
. conditional transfers of control 
• the target address is close to the current PC location 


е branch distance from the incremented PC value fits 
into the immediate field 


“ for example: loops, if statements 
е jumps 
“ unconditional transfers of control 
• the target address is far away from the current PC location 
* for example: subroutine calls 
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3» The Jump instruction is of the J-type format: 


Op® = 2 





< The jump instruction modifies the program counter PC: 








РСЯ оо 
I 
< The upper 4 bits of the РС are unchanged ks in 


СІ 3-format (J for Jump) 


— op |2 «ең 7 


6 bits 26 bits 
more —-relative addressing. 
м Ex 25 words = 100 bytes 
dl 2 Label # addr Label = 100 (i.e. 100/4 = 25) 


oooo10 OoOOOOOOOOOOOOOOOOOOOO11001 
6 bits 26 bits rword-re/ative adcressinig 





Jump Instruction 





e Jump (j and jal) targets could be anywhere in 
program 
— Encode “full” address in instruction 


o | аме 7 


6 bits 26 bits 


(Pseudo)Direct jump addressing 






Target address = PC, -а: (address x 4) 


e Jump register (jr) 
Copies register to PC 
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| op | jump target address | 
seme от 
і-2 


ххх] |0 





From РС. ; ! | 4 i 
(incremented) Effective target address (32 bits) 
op rs | rt rd - sh іп 
20 | 
Jr rs R 11 1100000 00000 
ALU Source Unused Unused Unused jr = 8 
instruction register 


target field of jump instruction 


jump instruction 00001010110001010001010001100010 


PG 01010110011101100111001010010110 
сору high-order four bits from FC 


26-bit target field from jump instruction 
32-Bit Jump Address 01011011000101000101000110001000 


Shift Left two 
positions 


=" jump: J (J-type) < 
= jump register: jr (R-type) 
" jump and link: jal (J-type) - 






these are the only two 
J-type instructions 


Instruction Opcode Target 
J label 000010 coded address of label 
jal label 000011 coded address of label 
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MIPS Jump Instructions 


Jump instructions: unconditional transfer of control 


3 target # jump 
go to the specified target address 
dm rs # jump register 


go to the address stored in rs 
(called an indirect jump) 


jal target # jump and link 
go to the target address; save PC+4 in %га 


jalr  rs,rd #jump and link register 
go to the address stored in rs; rd = PC+4 
default rd is $га 


I-type Format for Branches 


I-type format used for conditional branches 


31 26 20 16 


25 21 0 


15 


* opcode = control instruction 

е rs, rt = source operands 

“ immed = address offset in words, + 215 
е hardware sign-extends when uses (replicate msb) 
е target address = PC + (immed*4) 
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MIPS Conditional Branch Instructions 


“ MIPS compare and branch instructions: 





bea Rs, Rt, label If (Rs == Rt) branch to label 
bne Rs, Rt, label if (Rs != Rt) branch to label 





< MIPS compare to zero & branch instructions: 


Compare to zero Is used frequently and implemented efficiently 





bltz Rs, label if (Rs < Ө) branch to label 
bgtz Rs, label If (Rs > Ө) branch to label 
blez Rs, label if (RS <= Ө) branch to label 
bgez Rs, label ЇЇ (Rs >= Ө) branch to label 





4%» beaz and bnez are defined as pseudo-instructions. 





Branch Instruction Format 


“» Branch Instructions are of the l-type Format: 


-Туре Format 
beq Rs, Rt, label 16-bit Offset 


bne Rs, Rt, label 16-bit Offset 


bgtz Rs, label 
bltz Rs, label 


bgez Rs, label 


16-bit Offset 
16-bit Offset 
16-bit Offset 





Тора Rs | Rt 
(ор-5| Rs | Rt 
‘op=7| Rs | o 
‘op=1| Rs | o 
op=a| Rs | a 


“» The branch instructions modify the PC register only 
““РС-Ке!айуе addressing: 


If (branch is taken) PC = PC + 4 + 4xoffset else PC = PC+4 
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Branch Distance 


Extending the displacement of a branch target address 
* Offset is a signed 16-bit offset 
* represents a number of instructions, not bytes 
. added to the incremented PC 
* target address is a word address, not a byte address 


* in assembly language, use a symbolic target address 





Branching Far Away 


е If branch target is too far to encode with 
16-bit offset, assembler rewrites the code 


Example, 1.1 too far: 
реа 550,551, Ll 
Rewritten as: 
pne 550,551, L2 
J L1 
Translating an IF Statement 
x Consider the following IF statement: 
if (а == b) с = d + e; else c = d — е; 


Given that a, b, c, d, e are in $to .. $t4 respectively 


4» How to translate the above IF statement? 


bne $to, $t1, else 


addu $t2, $t3, $tA 


J next 
else: subu $t2, $t3, $t4 
next : “ Ж ж 
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Compare Instructions 


% MIPS also provides set less than Instructions 


slt Rd, Rs, Rt f (Rs<Rt) Rd=1elseRd=0 
sltu Rd, Rs, Rt unsigned < 
slti Rt, Rs, imm f (Rs «imm) Rt= 1 else Rt= 0 
sltiu Rt, Rs, imm unsigned < 





Signed vs. Unsigned 


Signed comparison: slt, slti 
Unsigned comparison: situ,sltui 


Example 
550 = 1111 1111 1111 1111 1111 1111 1111 1111 
551 - 0000 0000 0000 0000 0000 0000 0000 0001 


slt SCO, 580, 981 # signed 
(—1 < +1 = $tO = 1) 


sltu 50, 550, 551 Ж unsigned 
(+4,294,967,295 > +1 = S$tO = 0) 
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Compare Instruction Formats 


HE ва, Re Mt [Ree оме | opro | Re | кє | na | ө | Onze 


situ ка, Rs, Rt [Rdo(Rs <Rt)?2:0 | opro | As | Re | та | ө | ом 
siti Rt, Rss dm [RE=(Re <im?1:@ | exa | Rs | Rt | 16-bit immediate | 
$* The other comparisons are defined as pseudo-instructions: 

seg, sne, sgt, sgtu, sle, sleu, sge, sgeu 





Pseudo-Instruction Equivalent MIPS Instructions 
sgt $t2, $tO, $t1 slt $t2, $11, $tO 


subu $t2, $tO, $t1 


sleu $12, $tO, $11 
$t2, $tO, $ SPEI epee. St2; 1 





e Can use slt, beg, bne, and the fixed value of О 
in register $zero to create other conditions 


— less than blt 551, Ss2, Label 
— less than or equal to ble 551, 552, Label 
— greater than bgt 551, 552, Label 


— great than or equal їо bos 551, 552, Label 


e Such branches are included in the instruction set 
as pseudo instructions 
— Recognized (and expanded) by the assembler 


— Reason why the assembler needs a reserved 
register (Sat) 
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е Why not blt, bge, etc? 


е Hardware for <, >, ... slower than =, ж 
— Combining with branch involves more work 
per instruction, requiring a slower clock 


е beg and bne are the common case 


Pseudo-Branch Instructions 


“* MIPS hardware does NOT provide the following instructions: 


blt, bltu branch if less than (signed / unsigned) 
ble, bleu branch if less or equal (signed / unsigned) 
bgt, bgtu branch if greater than (signed / unsigned) 


bge, bgeu branch if greater or equal (signed / unsigned) 
** MIPS assembler defines them as pseudo-instructions: 
Pseudo-Instruction Equivalent MIPS Instructions 


slt $at, %%0, $t1 


blt $t0, $t1, label bne $at, $zero, label 


$at ($1) is the assembler temporary register 
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Example of one-to-one pseudoinstruction: The following 
not 550 # complement (550) 
is converted to the real instruction: 


nor  $s0,5s0,5zero # complement (550) 
Example of one-to-several pseudoinstruction: The following 
abs 510,580 # put |(550) | into SLO 


is converted to the sequence of real instructions: 
add St0,Ss0,Szero # copy x into $10 


slt Sat,SL0,Szero # is x negative? 
реа Sat,Szero,+4 # if not, skip next instr 
sub  S$tO,S5zero,S$5s0 # the result is O - x 


Target Addressing Example 


: Assume Loop at location 90000 


Loop: sll $t1, $53, 2 80000 о јој до 
add $t, $t, $56 800409 2/9 | 0 | 32, 
lw 900, 0081) — 80008|35.| 9 
bne $t0, 555, Exit 80012 8 
addi 553, $93, 1 8006 8 | 19 | 19 


J Loop 80020 | 2 20000 
Exit: .. 80024 
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Branch оп equal 


Conditionally branch the number of instructions specified by the offset if register 
rs equals rt. 


Branch on not equal 


bne rs, rt, label (5 |н [rt | offset č oč — 
" 5 5 16 


Conditionally branch the number of instructions specified by the offset if register 
rs is not equal to rt. 


Jump 


j target 2 | target | 


6 26 





Unconditionally jump to the instruction at target. 


Jump and link 


Jal tarqet [a [tege — 5 00 00000 0 жөнде? | 
6 


26 


Unconditionally jump to the instruction at target. Save the address of the next 
instruction in register $ra. 


Jump and link register 


5 5 5 5 6 


6 


Unconditionally jump to the instruction whose address is in register rs. Save the 
address of the next instruction in register rd (which defaults to 31). 


Jump register 


6 5 15 6 


Unconditionally jump to the instruction whose address is in register rs. 
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Compiling IF statement 
- C code: 
if (1ш-1) f = ағ; 
else f = g-h; 
= f, g, ... іп $s0, $s1, ... 
- Compiled MIPS code: 


bne $s3, $s4, Else 
add 550, $51, $52 





1 Exit 
Else: sub %50, %51, $s2 
EXIT: 0-0 


Assembler calculates addresses 








show a sequence of MIPS instructions corresponding to: 


if dej) x = xtl; z = l; else ШЕЕ 0424 


Solution 


Similar to the "if-then" statement, but we need instructions for the 
"else" part and a way of skipping the "else" part after the “then” part. 


slt 5Е0, 552,581 # j«i? (inverse condition) 
bne 510,57его,е1ѕе # if j«i goto else part 
addr ЕТЕТ t begin then part: x = xtl 
addi St3,szero, 1 # 2 = 1 
) endif # skip the else part 

else: # begin else part: y = у-1 





f Z = 74? 
endif:... 
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$ Short-circuit evaluation for logical OR 


«* |f first condition is true, second condition is skipped 


if (($t1 > Ө) || ($t2 < 0)) {$t3++;} 





** Use fall-through to keep the code as short as possible 


bgtz $11, L1 # 15% condition true? 
bgez $t2, next # 2"d condition false? 
addiu $t3, $t3, 1 # increment $t3 





Conditional Move Instructions 


төш жа Rey m [i tows) теш [ope | ns [m [ка| |ә” 
me м, mes кє [ue cutee) um [owe | |е | na | ө mm 


$t2, 


$t1, $t2, $t3 | $t4, $t2, $13 


L2 $t1, $t4, %%0 
$t1, $t2, $t3 





* Conditional move can eliminate branch & jump instructions 
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Compiling LOOP statement 


(1 C code: 
while (save[i] == К) i += 1; 


(1 MIPS code: (i іп %53, k in %55, base address of save іп $s6) 


The first step is to load save[i] into a temporary register. Before we can load save[i] into 
a temporary register, we need to have its address. Before we can add i to the base of array 
save to form the address, we must multiply the index i by 4 


Loop: sll $1, $53, 2 # $t1=i*4 
To get the address of save[i], we need to add $t1 and the base of save іп $s6 
add $ti, $11, $s6 # $ti-address of save[i] 
Now we can use that address to load save[i] into a temporary register 
Iw $tO, O($t1) # $tO-save[i] 
The next instruction performs the loop test, exiting if save[i] = k: 
bne %%0, $s5, Exit # Exit if save[i] != k 
The next instruction adds 1 to i: 
addi $s3, $s3, 1 Ж i=i+1 
The end of the loop branches back to the while test at the top of the loop. 
j Loop # go to loop 
ЕЖ», 


Ex 
while (A[i] == k) i= i + j; 
Initially %53, $s4, %55 contains i, j, k respectively. 
Let $só store the base of the array A. Each element of А 
is a 32-bit word. 


add St1, $s3, $s3 # $t1 = 2*i 

add St1, St1, $t1 # $11 = 4*i 

add St1, $t1, $s6 # $t1 contains address of АП) 
Iw $tO, O(St1) # $10 contains $A[i] 


add $s3, $s3, $s4 Місізі 
bne StO, %55, Exit # goto Exit if A[i] = k 
j Loop # goto Loop 


«next instruction» 
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Basic Blocks 


A basic block is a sequence of instructions 
with 
= No embedded branches (except at егісі) 
= No branch targets (except at beginning) 
A compiler identifies basic 
blocks for optimization 


An advanced processor 
can accelerate execution 
of basic blocks 


Supporting Procedures Іп Computer Hardware 


% A function (or a procedure) is a block of instructions that can be called at 
several different points in the program 


5> Allows the programmer to focus on just one task at a time 
5> Allows code to be reused 


5> Reduce duplication of code and enable reuse. 
** The function that initiates the call is known as the caller 


“» The function that receives the call is known as the callee 


** When the callee finishes execution, control is transferred back to the caller 
function. 


** A function can receive parameters and return results 


** The function parameters and results act as an interface between a function 
and the rest of the program 


** To execution a function, the caller does the following: 


- Puts the parameters іп a place that can be accessed by the callee 
- Transfer control to the callee function 


** To return from a function, the callee does the following: 


- Puts the results 1n a place that can be accessed by the caller 
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- Return control to the caller, next to where the function call was made 


m Caller: m Callee: 
" passes arguments to callee " performs the procedure 
= jumps to the callee " returns the result to caller 


" returns to the point of call 
" must not overwrite registers or 
memory needed by the caller 


MIPS instructions for procedure call and return from procedure: 


jal proc # jump to loc proc and link; 
# “link” means "save the return 
# address" (PC)+4 in Sra (531) 
ТЕ rs # qo to loc addressed by rs 
Fa 


Caller Callee г; 





om t 
- 
= 
а ша " 


$31 = $га (ге 


р 
РС 





1 "jump" and “return”: 
ш jal ProcAddr # issued in the caller 
* jumps to ProcAddr 
* save the return instruction address in $31 
* PC JumpAddr, $31 = РС+4; 


E jr $31 ($га) # last instruction in the callee 
* jump back to the caller procedure 
. PC = $31 
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Steps required 
1. Place parameters in registers 
ба0 - $a3:four argument registers 
Transfer control to procedure 
Acquire storage for procedure 
“ $t0—$t9: temporaries, сап be overwritten by callee 
• 550-557: saved, must be saved/restored by callee 
Perform procedure's operations 
Place result in register for caller 
бу0 - $v1: two value registers for result values 


Return to place of call 


Register Usage 


- $a0 - $a3: arguments (reg's 4 - 7) 
25/0, 5/1: result values (reg s 2 and 3) 
- 910 - $t9: temporaries 


s Can be overwritten by callee - Saves PC+4 in register $га to have a link to the 
- $50- $57: saved next instruction for the procedure return 


- Machine format (J format): 


w № 


n > 


O> 





* MIPS procedure call instruction: 


jal ProcedureAddress #jump and link 


| . ЕТІН 
- 500: global pointer for static data (reg 28) = — 
- $sp: stack pointer (reg 29) * Procedure return with 


jr $га return 


- Sfp: frame pointer (reg 30) 


- Sra: return address (reg 31) - |nstruction format (R format): 
o |3 | | | | 08 
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Procedure Call Summary 


m Caller 
= Put arguments in $a0-$a3 
= Save any registers that are needed ($ra, maybe $tO-t9) 
" jal callee 
" Restore registers 
" Lookfor result in $vO 


m Callee 
= Save registers that might be disturbed ($s0-$5s7) 
" Perform procedure 
" Putresult in $vO 
" Restore registers 
= jr gra 





Non-Leaf Procedures 


* Procedures that call other procedures 


е For nested call, caller needs to save on the 
stack: 


— |ts return address 


— Any arguments and temporaries needed after 
the call 


е Restore from the stack after the call 
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= 





main program and a procedure 





Leaf prodecure 
main 


} Prepare 
PC 





Prepare 
to continue 








| Save, etc. 
) Restore 


Nested Procedure Calls 


main 










Prepare 
) (0 call 
} 






PC Procedure 
Prepare 
to continue Procedure 





Text version 
is incorrect 
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Leaf Procedure Example 


C code: 
int leaf example (int g, h, 1, J) 
i int f; 
f = Cg + h) - G +j); 
return f; 
} 
= Arguments g, ..., ј іп $aO, ..., аз 
= f in $50 (hence, need to save $50 on stack) 
= Result in $v0 


Q MIPS code 
addi $sp, $sp, -4 
sw $s0, 0($sp) 
add $t0, $a0, $al 
add $t1, $a2, $a3 #Procedure body 
sub $s0, $t0, $t1 
add %у0, $50, $zero #Result 
Iw $50, 0(%5р) 
addi %5р, $sp, 4 
jr $ra #Return 


| #Save $50 оп stack 


| #Restore $50 
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Example 
It is a procedure that doesnt call another procedure. 
C code: 


талп) 
i int х,у,2; 


2 - Avgl(x,y); 


] 
int Avgl(int g, h) 
{ int f; 
f = (9 + h)/2; 
return Т; 


} 
- Assume x, y, z are іп 550, $51, $52 
= Тіп $50 (hence, need to save 550 in the stack) 
= Result in %у0 


Main MIPS Code 
Main: 
add  $a0,$s0,$zero # x in $a0 corresponds to g 
add Sal.$sl.$zero # уіп $a1 corresponds to h 
Jal Avgl # $razNxt address, jump to Avg1 
Nxt: add 582,5 v0. S zero # result in 2 
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Leaf Procedure МІР5 Code 


Avgl: 

addi — $sp. $sp. -4 

SW 550, 0($5р) #Save 550 оп stack 
ааа $tO, $a0. Sal #sum 

srl 550, 810, 1 #divide Бу 2 

ааа Фу0, $50, $zero #Save result 

lw 550. O($sp) #Restore $s0 

addi Ssp, Ssp, 4 
jr Фга #Return 


Procedures 


% Consider the following swap procedure (written in С) 


>» Translate this procedure to MIPS assembly language 












void swap(int v[], int k) 
{ int temp; 
temp - v[k] 
v[k] = v[Kt+1]; 


V[Kt1] = temp; 


swap: 

511 $t0,$a1,2 StO=k+*4 
ааа $t0,$t0,$a0 StO-v-c-k*4 
lw $t1,0(St0) Stl=v[K] 
Parameters: lw  $t2,4(StO) $t2-—v [k-*1] 


ба0 - Address of v[] sw 5%2,0(5%0) v[k]=$t2 


$al =k, and sw §t1,4($t0) v[k+1]=St1 
Return address [5 ІП $ra jr $га 


return 
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Non-Leaf Procedure Example 


C code: 


int fact (int n) 


{ 


if (n < 1) return 1; 
else return n * fact(n - 1); 


e Argument n іп 5а0 


e Result in S v 0 


MIPS code: 


$sp, -8 

4($sp) 

O($sp) 
5111 $10, $a0, 1 


beq %%0, $zero, 11 


Argument n in $a0, Result in $v0 


adjust stack for 2 items 
save return address 

save arqument 

test for n < 1 





addi $vO, $zero, 1 
addi $sp, $sp, 8 


jr $га 

LT: addi 430; 
jal fact 
iw  $a0, O($sp) 
lw  $ra, 4($sp) 
addi $sp, $sp, 8 
mul %у0, %а0, $vO 
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iF Sg ese ie 
рор 2 items from stack 
and return 


' else decrement n 


recursive call 
restore original n 

and return address 
pop 2 items from stack 
multiply to get result 


# and return 
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String Сору Example 


C code 

Null-terminated string 
void strcpy (char x[], char y[]) 
i int 1; 

1 = O; 

while ((x[1]=y[1])!='NO') 

1 += 1; 

} 


Addresses of x, y in %а0, $a1 
i in $50 


MIPS code: 


strcpy: 
addi $sp, $sp, -4 # adjust stack for 1 item 
sw 550, O($sp) # save $s0 
L1: add $11, 550, $al # addr of y[1] in $t1 
lbu $t2, O($t1) # $12 = у[1] 
ааа $13, 550, $a0 # адаг of x[1] 1n $13 
sb $12, O($t3) # х1] = УП] 
ред $12, $zero, L2 # exit loop 1f y[1] == 0 
add1 550, 550, 1 # 1 = 1 + al 
1 L1 # next iteration of loop 
L2: lw 550, O($sp) # restore saved 550 
addi $sp, $sp, 4 # pop 1 1tem from stack 
# and return 
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Copying a StTring 


A string in C is an array of chars terminated with null char 


і - 0; 


do í ch = source[i]; target[i] = ch; i++; ) 
while (ch != '\@'); 





Given that: $ag = &target and $al = &source 


loop: 
lb $tO, load byte: $tO = source[i] 
sb $tO, store byte: target[i]- %%0 


addiu %а0, фаӨ = &target[i] 
addiu %а1, $a1 - &source[i] 
bnez  $t0, loop until NULL char 





Example of a Loop Structure 





for (121000; i>0; i--) Loop: Iw $s0,0($s1) :$s1-x[1000] 
x[i] 7 x[i] * h; add 553, 550, $s2 ;:$s2-h 
= sw $53, 0($s1) 
Assume: addresses of addi $51, $s1, # - 4 


x[1000] and x[0] 
are in $s1 and $s5 
respectively; h is 
in $s2; 


bne $s1,$s5,Loop ;$s5-x[0] 
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MIPS Memory Layout 


The top addresses from 0x80000000 to OXFFFFFFFF are not available to user programs. They 
are used for the operating system and for ROM. When a MIPS chip 15 used іп an embedded 
controller the control program exists in ROM іп this upper half of the address space. 


The parts of address space accessible to a user program are divided as follows: 


Reserved 
Memory below the text segment and above the stack 1s reserved for use by the operating 
system. 

lext 
The assembly language instructions begin at address 0x400000. 

Data (Static + Dynamic) 
This holds the data that the program operates on. Static data store global variables (e.g., static 
variables in C, constant arrays and strings). $gp is reserved to point to static data. Dynamic 
data storage grows toward higher memory locations. Dynamic data is data that is allocated 
and deallocated as the program executes. 


Stack 


For function and procedure linkage. $spis reserved to point to stack segment. The $sp is 
initialized to (7FFF ЕЕЕСҘь. 


ОХЕЕЕЕЕЕЕЕ 


ÜxFFEFFFEF [77-77 7277,77. TAL 07. axffffeele 
аии Memory mapped IO 
Е axffffaeog 


Kernel data 
Вхэвовееве 


0х 80000000 2, 2 3 
ÜxT7FFFFFFF Kernel text 
stack Segment Bx8088000009 


Ox? FF Fffff 
sm T. 
Data Segment 
0x1 0000000 User level 
Text Segment Dynamic data 
Static data 
0х0 0200000 I Вх1вевеввае 


ванай Г Text segment 


Ox0 0000000 des E әходагегааа 


0x@0000900 
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Hex address 00000000 










1 M words 
00400000 


Text segment 
63 M words 


10000000 
Addressable 


with 16-bit 
signed offset 


10008000 
Data segment 


1000ffff 


448 M words 
$28 
$29 
$30 
Stack segment 





Jffffffc | 
80000000 


second half of address 
space reserved for 
memory-mapped I/O 


55р T last word alloc on stack 


Dynamic data 
Static data * Sgp # ptr into global data 


Spc d ptr to next instruction 
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Lower 
Mem 
Addr 


Higher 
Mem 
Addr 


Where is the stack located? 


Reserved 
| y ДАйй 
Instruction . | ШЕН 1-2 
segment i-1 
[cor da. Top of stack 
| / I ММ і+2 : ЖЕҢГЕН 
Data ə | бер =i 


: segment 
т 


— 


: Stack 
: segment 





(1 Use of the Stack in procedure calls 
(1 The stack 


Zx 


^ xw wx Ж 


А dedicated area (part) of the main memory. 

LIFO 

Hold values passed to a procedure as arguments 
Save register contents when needed 

Provide space for variables local to a procedure 
Stack operations: 

* SW : place (push) data on stack 

* Iw : remove (pop) data from stack 

In MIPS, it grows from high address to low address as you 
push data on the stack. 

Consequently, the content of the sp ($sp) decreases. 
$29 ($sp) stores the address of the top of stack 
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Using the Stack for Data Storage 





Spilling Registers 


What if registers for argument and return values 
are not enough? 
= Use stack high addr 





* Ssp (S2 9) is used as stack pointer со 
— Push $sp 
Ssp = Ssp — 4 
copy data to stack at new Ssp 
— Pop 
get data from stack at 5sp 
Ssp бер +4 





lovv асісіг 


Q To push elements onto the stack: 


— Move the stack pointer $sp down to Ssp 
make room for the new data. 








B 


c 


— Store the elements into the stack. 


Q For example, to push registers $t1 and о оо 
512 onto the stack: x 


addi $sp, $sp,-8 Before 
sw $11, 4($sp) 
sw $12, OC$sp) 





1 
B 





_ 





Q An equivalent sequence is: 


word 2 
sw $11, -4($sp) St1 
sw $12, -8($sp) 
addi $sp, $sp,-8 Ssp St2 


— 


After 
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How Procedures use the Stack 
m But diffofsums overwrites 3 registers: $10, $11, $s0 


# MIPS assembly 

# $50 = result 

diffofsums: 
айа $tO, $a0, $al # $10 = f + р 
айа $tl, $a2, $a3 # $11 = h + i 
sub %50, $t0, $t1 # result = (f + g) - (h + i) 
add $vO, $s0, $0 # put return value іп %у0 
jr gra # return to caller 





Storing Register Values on the Stack 


# %50 = result 
diffofsums: 














addi $sp, $sp, -12 # make space on stack 

# to store 3 registers 
sw $50, 8($sp) # save %50 on stack 
sw $10, 4($sp) # save $10 on stack 
sw $11, O($sp) # save $t1 on stack 
add %%0, $a0, $a1 $4 $10 = f + g 
add $t1, $a2, фаз # $11 = h + i 
sub $50, $tO, $t1 # result = (f + р) - (h + i) 
add $v@, $s0, $0 # put return value in %у0 
lu  $t1, O($sp) 4 restore $t1 from stack 
lu $10, A($sp) # restore $tO from stack 
lu  $s0, 8($sp) # restore $s0 from stack 
addi $sp, $sp, 12 # deallocate stack space 
jr $га # return to caller 
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rc ——— ÁÀ——ÓÀÀ 


File Edit View Debug Tools Help 















а Ea E [= Гау 115 с 9 [E] ч ——— |] unlimited Speed i Value 
1 lui $29,0xabcd 24 0х00000000 
2 ori 829,529, 0x4320 25 0х99887766 
3 гв gaxogogoggOgOOO 
4 lui $25, 0х9988 27 pcan 
5 ori $25,$25,0x7766 өз канан 
6 29 üxabcdi3lc 
7 Оха 0000000 

Oxffffftt: 
В БЕ e шр 
- addi 529,529,-4 0х00000000 
10 0200400018 
11 






Address 







Пхарсяа4з1с 





MIPS Machine Code 


3cldD0n0ü00 
37bd4320 
3с180000 
37397766 
afb9fffc 
23bdfffc 


lui $29,0xabcd 
ori 529,529,0х4320 
lui £25,0x3988 

ori $25,5$25,0x7766 
sw $25,-4 ($29) 
addi $29,629,-4 


ча wag ча оча Se E 





Char 3 Insert 
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Stack Frames 


+ 


* [he stack segment Is used by functions for: 
+ Passing parameters that cannot fit In registers 
+ Allocating space for local variables 

+ Saving registers across function calls 


+ Implement recursive functions 


Ф. 


* The stack segment is implemented via software: 
+ The Stack Pointer $sp = $29 (points to the top of stack) 


+ Тһе Frame Pointer $fp = $30 (points to a stack frame) 


te 


* Stack frame Is an area of the stack containing ... 

+ Saved arguments, registers, local arrays and variables (If any) 
* Called also the activation frame or activation record 
* Frames are pushed and popped by adjusting ... 


+ Decrement $sp to allocate stack frame, and increment to free 
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Allocating Space on the Stack 





The segment of the stack containing a procedure's 
saved registers and local variables. 








high addr 
Sfp 
sala return sia 
Ssp 
low addr 
$fp — 
55р-- 
$fp — 
%5р-- 
Low address 
(a) before, 
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The frame pointer ($ fp) 
points to the first word of the 
frame of a procedure 
— provides a stable “base” 
register for the procedure 
- Sfp is initialized using 55р 
on a call апа $sp is restored 
using Sfp оп a return 


$fp — 


Saved argument 
registers (if ay. 


Saved saved 
шы жемін (if то) 


(b) during, (c) after the procedure call 
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MIPS Addressing Mode 


The MIPS addressing modes are the followine: 


1. Immediate addressing, where the operand 15 a constant within the instruction itself 


2. Register addressing, where the operand 15 a register. 

3. Base or displacement addressing, where the operand is at the memory location 
whose address 1s the sum of a register and a constant in the instruction. 

4. PC-relative addressing, where the branch address is the sum of the PC and a 
constant in the instruction. 

5. Pseudodirect addressing, where the jump address 1s the 26 bits of the instruction 


concatenated with the upper bits of the РС. 


oP | == | r [immediate 


2. Register addressing 


Registers 


[ — Register — 


Memory 


Memory 





5. Pseudodirect addressing 


— 
| 
- ° 


5 
d 


Dr. Ahmed Jaber Spring 2019 


117 
Register Only Addressing 


m Operands found in registers 
" Example: 
add $50, %ғ2, $t3 
= Example: 
sub $18, $51, $0 


Immediate Addressing 


= 16-bit immediate used as an operand 
" Example: 
addi $54, $t5, -73 
" Example: 
ori  $t3, $t7, OxFF 


Base Addressing 

m Address of operand is: 
base address + sign-extended immediate 
" Example: 


lw $s4, 72($0) Address = $0 + 72 
" Example: 
sw $t2, -25($t1) Address = $t1 - 25 


PC-Relative Addressing 


ФО, $O, else 
Фуд, Фо, 1 
$sp, $sp, i 


tra 
$aO, фав, -1 
factorial 





Pseudo-direct Addressing 


Oxeedgqoeeo5s5c 12 sum 


Охведееведа ас $vO, фае, $al 








JTA 0000 0000 0100 0000 0000 0000 1010 0000 (0x004000A0) 





26-bit addr 0000 0100 0000 0000 0000 1010 00 (0x0100028) 
LLJLILILILI 1—2 
о 1 0 о о 2 8 
Field Values Machine Code 
op mm | Р 
ә — croc 


& bits 26 bats. & bits 26 bits 
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>~O QO т ш (O со 4 O) (л + фо N — о 


5 
NUL eT Te тын 


What is data? 





Numbers — binary encoding 
Characters — ASCII, Unicode 
Strings — sequences of characters 
Audio 

ә 1-D array (time) of sound pressure 
mages 


What else? ... 
Programs! 
ASCII Characters 
ASCII инна standard code for information interchange) ` 
8-9 a-f 


More More 





controls symbols 


8-bit ASCII code 
(col #, row #), 


e.g., code for * 
iS (20) hex ОГ 
(0010 1011 wo 
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Decoding Machine Code 


Decoding: Reverse-engineer machine languaqe to create the 
assembly language 
Example: 00af 8020hex 
Convert hexadecimal to binary 
0000 0000 1010 1111 1000 0000 0010 0000 
Look at the op field to determine the operation 
The op-field is 000000. It is an R-type instruction 
Decode the rest of the instruction by looking at the field values 
ор rs rt rd shamt funct 
000000 00101 01111 10000 00000 100000 
Reveal the assembly instruction 
add %50, Фа1, $17 


Field size | B bits | 2 bits 2 bits | Sbits | 5bits | bts АП MIPS instructions are 32 bits long 
R-format | op | rs rt | rd | shamt funct Arithmetic instruction format 


l-format op | rs rt | address immediate Transfer, branch, i mm. format 


J-format ор target address Jump instruction format 





Assembly Code Machine Code 
lw St2, З2 (50) ОхВСОАОО2О 
add 5-50, $s1, Ss2 0x02328020 
addi $t0O, 553, —12 0x2268FFF4 
sub $10, St3, Sts OxO16D4022 


Stored Program 
Instructions 


0040000C 016D40 2 2} 


00400008 226 ВЕЕ F 4 


00400004 023280204 


00400000 PC 
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6 bits 5 hits 5 hits 5 hits 5 hits 6 bits 


[os | = | ra [shame ma | C 


6 bits | 5 hits | 5 bits 16 bits 


| 4———————————9 | 


| 4——— — r | — > 
Pr | = | = | өнөн | Home 


6 bits 26 hits 


|—— | 
op 





и 


ор 





Copy | Load upper immediate 15 

0 

Subrac 0 

Arithmetic < |Р% 165 than 0 
8 

Set less than immediate 10 

AO [e таты | 0 

0 

0 

юк eraut | 0 








Logic AND immediate | andi 12 
13 
rt,rs 14 


Load word 


Ori 

Store word - rt, im 

0 
Control transfer 1 
4 
9 


Memory access | x 


1 | r5 ) 
(гБ) 
| 


ғ 
F 
Ц 
1 
ғ 
ғ 


2.16) Provide the type, assembly language instruction, and binary representation of 


instruction described by the following MIPS fields: 
ор=0 rs=3 rt=2 rd=3 shamt=0 funct=34 
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Step i of з 

Consider the various instructions of MIPS fields: 

ор=0, rs-3, rt=> rd-3, shamt=0, funct-34 

Based on the MIPS Instruction encoding (Refer FIGURE 2.5 In text book): 
е The opcode (op) value is “О” 


= The "ftunct" field is used to decide the variant of the operation (32-addition or 34-subtract). Here the "funct" field 
is 34. So, the instruction іс sub (subtract) and the type of instruction format is R-type. 


So, the MIPS fields contaln R-type Instruction format. 


Step 2 of 3 

Based on the MIPS register conventions table (Refer FIGURE 2.14 In text book): 
* rs=3 contains register Sv1 

= rt=2 contains register SvO 

= rd-3 contains register Sv1 

= The "ftunct" field іс 34 So, should use the instruction is "sub(subtract)" 


So, the assembly language Instruction Is “sub Sv1, 5v1, SvO” 


Step 3 of 3 


The fields of R-type Instruction format: 


op rs rt rd shamt funct 
Obits Sbits Sbits ^bits Sbits 6bits. 


Convert decimal values of MIPS fields into binary values: 
* opcode (op) = 0 = 000000 

*rs =3 =00011 

гі —2 =00010 

= rd =з =00011 

+ sharnt =0 -ОООООО 

+ funct 234 =100010 


After filling the values in R-type instruction format: 


op rs rt rd shamt funct 
OOOOOO |00011 O00010 |00011 | 000000 100010 





6bits S bits Sbits S bits Sbits 6bits 
Therefore, the binary representalon of Instruction: 


000000 00011 00010 00011 00000 100010 
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Translation and Startup 


Assembly language program 


Object: Machine language module Object: Library routine (machine language) 








Many compilers produce 
object modules directly 






> Static linking 
| Memory | 


The linker has the following responsibilities: 


> Ensuring correct interpretation (resolution) of labels in all modules 
> Determining the placement of text and data segments in memory 


> Evaluating all data addresses and instruction labels 


The loader is in charge of the following: 


> Determining the memory needs of the program from its header 

> Copying text and data from the executable program file into memory 
> Modifying (shifting) addresses, where needed, during copying 

> Placing program parameters onto the stack (as in a procedure call) 
> Initializing all machine registers, including the stack pointer 


> Jumping to a start-up routine that calls the program's main routine 
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Program Template 


# Title: 

# Author: 

# Description: 

# Input: 

# Output: 

HHHHHHHHHHHHHHHHS Data segment HHHHHHHHHHHHHHHHHTATAE 
. data 


HHHEHHHHHHHHHHHHHH Code segment dH IEEE EIER 
. text 
.globl main 


main: # main program entry 


li $vO, 10 # Exit program 
syscall 





* .DATA directive 

<> Defines the data segment of a program containing data 

+ The program's variables should be defined under this directive 

+ Assembler will allocate and Initialize the storage of variables 
>” . TEXT directive 

< Defines the code segment of a program containing Instructions 
* .GLOBL directive 

+ Declares a symbol as global 

+ Global symbols сап be referenced from other files 

+ We use this directive to declare main procedure of a program 
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Data Directives 


.BYTE Directive 


+ Stores the list of values as 8-bit bytes 


.HALF Directive 


+ Stores the list as 16-bit values aligned on half-word boundary 


.WORD Directive 


+ Stores the list as 32-bit values aligned on a word boundary 


.WORD w:n Directive 


+» Stores the 32-bit value w into n consecutive words aligned on a 
word boundary. 


-HALF w:n Directive 


+ Stores the 16-bit value w into n consecutive half-words aligned 
on a half-word boundary . 


.BYTE w:n Directive 


+ Stores the 8-bit value w into n consecutive bytes. 


.FLOAT Directive 


<> Stores the listed values as single-precision floating point 


DOUBLE Directive 


+ Stores the listed values as double-precision floating point 
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String Directives 

$ .ASCII Directive 

< Allocates a sequence of bytes for an ASCII string 
4$ .ASCIIZ Directive 

<+ Same as .ASCII directive, but adds a NULL char at end of string 

< Strings are null-terminated, as in the C programming language 
4$ .SPACE n Directive 

< Allocates space of n uninitialized bytes in the data segment 
“» Special characters in strings follow C convention 


< Newline: Mn Tab:\t Quote: V 


Examples of Data Definitions 


. BYTE "А", 'E', 127, -1, '\n' 
. HALF -10, Oxffff 

. WORD 0x12345678 

. WORD 0:10 

. FLOAT 12.3, -0.1 


. DOUBLE 1.5e-10 


.ASCII "A String\n" 


. SPACE 100 
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Memory Alignment 
x Memory is viewed ас an array of bytes with addresses 
<- Byte Addressing: address points to a byte in memory 
«> Words occupy 4 consecutive bytes in memory 
< MIPS instructions and integers occupy 4 bytes 


> Alignment: address іс a multiple of size 


<= Word address should be a multiple of 4 
= Least significant 2 bits of address should be ОО 


<= Halfword address should be a multiple of 2 


>” ALIGN n directive 


+ 


<= Aligns the next data definition on a 2" byte boundary 


Symbol Table 


> Assembler builds a symbol table for labels (variables) 


+ Assembler computes the address of each label in data segment 





% Example Symbol Table 
vari: . BYTE 1, £.'2' 
| 0x10010000 
stri:  .ASCIIZ "My String\n" 
0x10010003 
Var2: .WORD 0x12345678 
0x10010010 
.ALIGN 3 
0x10010018 
var3: . HALF 1000 
— str 


0x10010000 | rU J| 0 |0 | unused 


0х10010010 |0x12345678|0|0/0/0 


var2 (aligned) Unused var3 (address is multiple of 8) 
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Summary of MIPS Instructions 


add — [add $sl,$s2,$s3 |$51=$52+$53 
subtract [sub $s1,$s2,$s3 |$51=$52-$53 
addi $51,$52,20 |%51-%52%20 Used to add constants 
store word — [sw $51,20($s2) |Метогу/552 + 20]=%51 | Word from register to memory = 
load half unsigned [Thu $51,20($87) 
Memory[$52 + 20] = $51 Halfword register to memory 
um. 
transfer | 
load byte unsigned |100 $51,20($s2) |$51 = Memory[$s2 + 20] 
Мепоу352% 201-551 
Метогу[352+20]=%51;$51=0 or 1 | Store word as 2nd half of atomic swap 
Loads constant in upper 16 bits 
гапа jand  $s1,$s2,$s3|$s1 = $52 & $53 Three reg. operands; bit-by-bit AND 
or [оп  $sl,$s2,$s3|$s1 2 $s21$53 | Three reg. operands; bit-by-bit OR — 
$51,$52,$53 | $51 2 - ($521 $53) Three reg. operands; bit-by-bit NOR 
Logical $51,$52,20 Bit-by-bit AND reg with constant 
orimmediate — ori $51,%52,20 |$s12$s2120 —  |Bitbybit OR reg with constant 
shift left logico $51, $52, 10 
Т1 $81, $82.10 
branch on equal [beq $51,$52,25 ($51 == $52) go to Equal test; PC-relative branch 
PC +4+ 100 


branch оп not equal |bne 451, 452,25  |if($sll- $32) goto Not equal test; PC-relative 
PC +4 + 100 


slt $sl,$s2,$s3 |if($s2< $53) %51=1, 
Conditional else $51 20 
branch setonlessthan — |sltu $51,$52,$53 |if($s52 < $53) БІзі; Compare less than unsigned 
set less than sliti $s1,$s2,20 | if($s2 «20) $sl=1: Compare less than constant 
immediate else $s] 2 0 


set less than sltiu $51,$s2,20 |if($s2 «20) $sl=1: Compare less than constant 
immediate unsigned else 131 = 0 unsigned 


Amo |} 250 [pomo — mp toate 
For switch, procedure retum 
jal 2 $ra =PC + 4; go to 10000 For procedure call 
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MIPS Organization Summary 







Processor 
Memory 
1...1100 
read/write 
addr 230 
32 cui 
32 bits words 
branch offset а | 
_ ‘Bol "An read data 
ly. 432 2 
| | | 
write data | 0...1100 
32 0...1000 
Ж” 0...0100 
өлім a 0112 {з |0..0000 
= 32 bits word address 
32 fhi ү 
inan 


MIPS (RISC) Design Principles 
» Simplicity favors regularity 
- fixed size instructions - 32-bits 
- small number of instruction formats 
+ opcode always the first 6 bits 
> Good design demands good compromises 
+ 3 basic instruction formats 
> Smaller is faster 
> limited instruction set 
- limited number (32) of registers in register file 
+ limited number (5) of addressing modes 
» Make the common case fast 
- arithmetic operands from the register file (load-store 
machine) 
> allow instructions to contain immediate operands 
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MIPS Instruction Implementation Types 


instruction Coding 
Instruction Type 


Non-Jump R-Type 








Non-Register Jump 


Jump Register 


The ALU is not used. 


Homework #3 


E Exercises in the Textbook (Computer Organization & 
Design, by Patterson & Hennessy, 5th Edition). 


e 2.1, 2.3, 2.4, 2.6, 2.7, 2.10, 2.16, 2.23, 2.27and 2.38. 
E LAB ( MIPSim2) 
e 2.12 and 2.19 
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Chapter 3 
Arithmetic Гог Computers 
Ш Operations оп integers 
Ш Addition and subtraction 
B Multiplication and division 
B Dealing with overflow 
B Floating-point real numbers 


Ш Representation and operations 


Fixed-radix positional representation with Kk digits 
k—1 
Value of a number: х = (х, ,Х,....Х,Х),- 2. X, г! 


For example: 
27 = (11011), = (1x24) + (1x2?) + (0x22) + (1x27) + (1x29) 


PA NN “емее 123 = 100*1+2*10+3*1 


23 22 21 29 102 101 109 


Representation Range and Overflow 


= = + š 
Overflow region ma ma Overflow region 
Numbers smaller ‚ Numbers larger 
than max ` 7 than тах” 





Finite set of representable numbers 
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Note:- 


- anunsigned integer containing n bits сап have a value between 
О to 2"' (which is 2" different values). 


- If a signed integer has n bits, it can contain a number between 
-2"-1 ќо %(2":1-1) 


Overflow Detection Logic 
* For a N-bit ALU: Overflow = Carryin[N - 1] ХОК CarryOut[N - 1] 
Саггуіп0 


A0—>| 1-bit 
By” АШ 
d CarryOut( 


—J Result 


CarryIn 






)5- Overflow 
— 
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Integer Addition 


LJ Example: 7+ 2 


+/: ОООО OOOO ... ОООО 0111 
ке: ОООО ОООО... ОООО 0010 


LY Overflow if result out of range 


m 


Adding +ve and —ve operands 
v no overflow 

Adding two +vwe operands 

v Overflow if result sign is 1 
Adding two —ve operands 


~ Overflow if result sign is О 


Integer Subtraction 


ы Add negation of second operand 
ОЈ Example: 7- 6 = 7 + (-6) 


+Z 0000 OOOO ... 0000 0111 
5 13111 1111 ..1111 1010 
T3 0000 ОООО ... 0000 0001 


(1 Overflow if result out of range 


a 


a 


m 


Subtracting two +ve or two —ve operands 
v no overflow 

Subtracting +ve from -ve operand 

v Overflow if result sign is O 

Subtracting -ve from +ve operand 

v Overflow if result sign is 1 
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Dealing with Overflow 


B The computer designer must therefore provide a way to ignore overflow in some 
cases and to recognize it in others. The MIPS solution is to have two kinds of 
arithmetic instructions to recognize the two choices: 


B Add (add), add immediate (addi), and subtract (sub) cause exceptions 


(interrupt) on overflow. 


Ш Add unsigned (addu), add immediate unsigned (addiu), and subtract 


unsigned (subu) do not cause exceptions (interrupt) on overflow. 
Ш Some languages (e.g., С, Java) ignore overflow 


B The MIPS C compilers use : addu, addui, subu instructions 


Ш Other languages (e.g., Ada, Fortran) require raising an exception 
B The MIPS C Fortran use: add, addi, sub instructions 
B On overflow, invoke exception handler 
Ш exception Also called interrupt on many computers . An unscheduled event that 


disrupts program execution; used to detect overflow. Interrupt an exception that 
comes from outside of the processor. 


Ш Save PC in exception program counter (EPC) register. 
Ш Jump to predefined handler address 


B mfcO (move from coprocessor reg) instruction is used to copy EPC into 
a general-purpose register so that MIPS software has the option of 


returning to the off ending instruction via a jump register instruction. 
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What about Performance: 





Full-Adder (FA) 


= Examine the Full Adder table Cin 


X 





Шы ыша жарғы ma kay mai шш шй iwa ugr шш шш 


0 
0 
0 
0 
1 
1 
1 
1 


=. — C) СӘ — -і £) CO 
= С) — СӘ — (C) = єс» 





In general, for bit r 
Сы = X Yj + C; (ҳ+у) 


Cout = x * y + Cin * (X + y 
y X + y) where c;,, = Cout, c= Cin 


Š = X'V'C + X'yC' + хус + хус 
=x > y c 


Critical Path of n-bit Rippled-carry adder 
Is n*CP 


e а 

І 
СЕР - 
во”. ALU Сі ' 


біш Аі ХОВ Bi XOR Сі 





A1— —> 
Bl 
Ai 

A 2 Bi 
B2 Ci Ci+1 

| Aj 

I B 
— 3: 1-7 

— ALU 
B3 | | 
Са =( Aj + Bj )C; + Aj. B, 
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Хзл Узі X1 y1 Хо Уо 


m E = | FA 
pe 


Cout "nd — | Cin 
| . Critical path 


931 94 50 


ci + 1 = (bi: ci) + (ai: ci) + (ai ` bi) 


= (ai ` bi) + (ai + bi): ci 





The Disadvantage of Ripple Carry 


° The adder we just built is called a “Ripple Carry Adder” 
* The carry bit may have to propagate from LSB to MSB 
* Worst case delay for a N-bit adder: 2N-gate delay 


ca 
ResultO 


CarryIn 
Result A 
Result2 


B CarryOut 
Result3 
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Carry Look Ahead (Design trick: peek) 


C-out 
O “kil 


C-in "propagate" 
C-in "propagate 
| "generate 






g=AandB 
p=AorB 


M EE : G= gd *p3 g2 +p p2 gl +p p2 pi g0 
B2 À 


— P = 90. 91.02.03 
” G= g3 *p3 g2 +p3 p2 gf + p3 p2 p 90 
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gi = аі: bi 
pi = ai + bi 


Using them to define ci + 1, we get 
ci + ] = gi + pi: ci 
To see where the signals get their names, suppose gi is 1. Then 
ci +1 = gi + pi-ci=1+pi-ca=1 


That is, the adder generates a CarryOut (ci + 1) independent of the value of Car- 
ryIn (ci). Now suppose that gi is 0 and pi is 1. Then 


ci +1 = gi + pitci =0+l'ci= ci 


That is, the adder propagates CarryIn to a CarryOut. Putting the two together, 
Carrylni + 1 isa 1 if either gi is 1 or both pi is 1 and CarrylIni is 1. 


cl = р0-(р0:с0) 
c2 -рі-ірі:р0)-ірі-р0: с0) 
c3 


| 


g2+(p2-gl)+(p2-pl-g0)+(p2-pl- pO : с0) 
c4 = g3+(p3-g2)+(p3- p2-gl)+(p3- p2- pl - g0) 
+ (p3 - р2 - р1 - ро · с0) 
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A plumbing analogy for carry lookahead for 1 bit, 2 bits, and 4 bits 
using water pipes and valves. 


СІ = рО--(р0: cO) 

c2 = gl+(p1l - g0) + (p1 - pO - с0) 

c3 = g2+(P2 - g1) + (p2 - Pl - g0) + (р2 · pl - pO - c0) 
c4 = g3+(p3- g2)+(p3 - р2 - 21) + (рз - p2 - p1 · gO) 


+ (p3 - p2 - p1 - ро - со) 
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Group Carry Look-ahead (16-bit): Abstraction 


Four 4-bit ALUs using carry lookahead to form а 16-bit adder. Note that the carries 
come from the carry-lookahead unit, not from the 4-bit ALUs. 


Carryin 


ResultOo—3 


Carry-lcokahead unit 


RHesultaà— 7 


RHesulta-— 1 1 


Resuli 2—1 5 






CarryOut 
For the first 4 bit ALU (ALUO0)) 
P0- p0 .pl .p2 .p3 


G0= 25 + P3.82 + P3.P2.81 + P3.P2.P1.820 
Cout = G0+P0.CarryIn 
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Group Carry-Lookahead 


C4 = 9з + ©з 92 + ра Рә 901 + Рз Рә Р{ Jo + P3*P2*P1 Ро Co 
Approach: use carry lookahead for 4-bit groups 
= “Super Propagate’ equations: 
Ро = ps p> pi “Do 
P, = pz ре ре, 
P2 = P11 Ро Po Ps 
Рз = Pis D44 Р1з P2 
= “Super Generate" equations: 
Go = 9з + (Рз 92) + (Рэ Рә 91) + (pa^ P2 "р, " 9o) 
G1 = gç + (ру ge) + (Рт pe" 95) + (Рт Pe “Ps * 94) 
G2 = 911 + (D44940) + (P^ P407 99) + (рл Ро “Po * 9g) 
Gs = 915 + (P45 914) + (Pis Pig 913) + (Dis Pia “P13 912) 
Then the equations at this higher level of abstraction for the carry in for each 
4-bit group of the 16-bit adder (СІ, C2, C3, C4 ) are very similar to the carry out equations 
for each bit of the 4-bit adder (cl, c2, c3, c4) | 
Cl = G0 + (P0 c0) 
C2 = G1 + (Pl: G0) + (P| PO- c0) 
C3 = G2 + (P2: GI) + (P2: P1 G0) + (P2: P1: PO: c0) 


C4 = G3 + (P3- G2) + (P3: P2- GI) + (P3- P2- P1 G0) 
* (P3: P2- PI- PO: c0) 
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2nd level Carry, Propagate as Plumbing 


C1 = G0 + c0 e PO 





Po = P3"P2"P1"Po 
Go = gs + (ps 02) + (рз Р2 91) + (pa^ pz "р: “ Go) 
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Arithmetic-Logic Units 
. Combinational logic element that performs 
multiple functions: 
s Arithmetic: add, subtract д 
s Logical: AND, OR, & NOR 
I | | B— З 
- Gates, multiplexer for logic | 
- functions & adder Operation 


Select 


Multifunction ALUs 


Arith fn (add, sub, . . .) 


Е Logic i 
unit 


Select fn type 
(logic or arith) 






> ALU |——F(A.B) 





Operand 1 
Result 


Operand 2 


Logic fn (AND, OR,.. .) 





Йй d шн ы Gum. ын qum. | та. _ ú "m - le =ч. 3 zz me | 4 ғ-ға FA | = ==. i = L Ы, pem -— Ё m zm Г | "ч. и Ú zm 
General structure of a sin D e arithmetic/logic 
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Functioning of 32-bit ALU 


ALU Control 


ALU Control lines 
| or | 01 


— Overflow 


| sito | 
| nor | 
* Result lines provide result of the chosen function applied to values of 
A and B 
* Since this ALU operates on 32-bit operands, it is called 32-bit ALU 
* Zero output indicates if all Result lines have value 0 
* Overflow indicates integer overflow of add and subtract functions; 
for unsigned integers, this overflow indicator does not provide any useful 
information 
* Carry out indicates carry out and unsigned integer overflow 





Full-Adder (FA) 


= Examine the Full Adder table Cin 


X 





— СӘ = С = 0 =| © 
— {у C) = (5 = = (75 


Ü 
0 
1 
1 
0 
0 
1 
1 


In general, for bit г 
Сы = X Yi + G, (ҳ+у) 
where c, = Cout, c= Cin 


Cout = x * y + Cin * (X + y) 
э-хус-хус + хус + хус 
Dr. Ahn =x®y@c 
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А 1-Bit ALU 
- The 1-bit logical unit for AND and OR. 


Operation 





A 1-Bit ALU (Subtraction) 


А 1-bit ALU that performs AND, ОК, Add 
& Sub Binvert Operation 
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Set Less Than (slt) Function 


sit function is defined as: 
000 ... 001 ifA«B,iLe.if A-B«0O 
А slt B = 
000 ... 000 А2 В, їе. ҒА-В>0 


. Thus, each 1-bit ALU should have an additional input (called 
“1 е55”), that will provide results for sit function. This input has 
value 0 for all but 1-bit ALU for the least significant bit. 

Бог the least significant bit Less value should be sign of A — B 





SLT BR, fA, $B 
„ (ЖА < $B) 
GR = 39ЬО0----04; 
alsa 
dog = 39ЬОо----ОО; 
> upper 31 biit are O 
SUT i imphmenttd Usina SUBTRACTION. 
( CBA - ER) i nagalv 
— WA < EB. 
da, = £ <> difperme їз nagalive 
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1-bit ALU for the most significant bit. The direct output from the (last) adder for the 
less than comparison called Set. 


Ainvert Operation 





Binvert Carryln 


Result 


Set 


Overflow 


ResultO 


Result 











Result? 





| FRresult31 
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Support conditional branch instructions 


(a-b20)25»a- b 


Zero = (Result31 + Result30 +... + Result2 + Result! + ResultO) 


32-bit ALU with 6 Functions 
АпогВ- А and В 


Binvert 





= Overtow 
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Multiplication 
MULTIPLY (unsigned) 


1 long-multiplication approach (shift-add method): 


* Paper and pencil example (unsigned): 
Multiplicand 1000 
Multiplier 001 A 
1000 
0000 
0000 
1000 —^— 
Product 01001000 


=m bits x n bits = m+n bit product 
= Binary makes it easy: 

0 — place 0 ( 0 x multiplicand) 

‚1 — place а сору ( 1 x multiplicand) 
= 3 versions of multiply hardware & algorithm: 





** Accomplished via shifting and addition 


** Consumes more time and more chip area than addition 
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Unsigned shift-add multiplier (version 1) 


= 64-bit Multiplicand reg, 64-bit ALU, 64-bit Product rea. 
32-bit multiplier reg Multiplicand 






Multip 





| / 32bits / 
Initially : Zeros | 


| | Write 
(Control 


64 bits М. 





Multiplier = datapath + control 





Multiply Algorithm - Version 1 


MultiplierO 21 , MultiplierO = 0 


1. Test. 
Multiplier0 


1а. Add multiplicand to product & 
place the result in Product reqister 


Product Multiplier Multiplicand 
0000 0000 0011 0000 0010 


0000 0010 0001 0000 0100 
0000 0110 0000 0000 1000 
0000 0110 


2. Shift the Multiplicand register left 1 bit. 


Yes: 32 repetitions 
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О Example: 
v M'ier: 0011 , M'and: 0000 0010 


oration | Stop UNS r” | Product _ 


|. O |Ійаую | 0044 | 00000010 | 00000000 - 
1 





3 


Observations оп Multiply Version 1 


1 cycle per step — 32x3 = ~ 100 cycles per 
multiply. However, One cycle per iteration can be saved 
by shifting multiplier and multiplicand in one cycle — 32x2 
50% of the bits in multiplicand are 0 

— 64-bit adder is wasted 

Os inserted in right of multiplicand as shifted to 
the left — least significant bits of product never 
changed once formed 1001 


Instead of shifting multiplicand БОО 
to left, shift product to the right 1001 
0000 
1001 
10100010 
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Example  6.bit x 6-bit 1st Version multiplier (58 x 23) 


Multiplicand = 58 = unsigned 6-bit = (111010), 
Multiplier = 23 = unsigned 6-bit = (010111), 

Product = 58 x 23 = 1334 = (010100110110) 

Initial Values: 

Multiplicand Register = MC is 12 bits 000000111010. 
Multiplier Register = MR is 6 bits = 010111. 
Product Register = РК is 12 bits = 000000000000. 





Test `< Multiplierü-0 


6-bit x 6-bit 1st Version multiplier (58 x 23) 
| | Ses O MR | MC | FPR ` 
Initial Values 010111 1000000111010 | 000000000000 


o 0000! 
1 |4: MR [0] =1 -> PR=PR+MC 000000111010 
2: SH_R MR, SH_LMC 1-bit |000101 |000011101000 | 
2: SH R MR, SH LMC 1-bit |000010 |000111010000 
SH В MR, SH L MC 1-bit 001110100000 
ЕНЕРІ o 
000000 |011101000000 
boron [owe] 0 
000000 |111010000000 


| Stop Result : PR= 58 x 23 = 1334 = 0x536= (010100110110); 
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Multiply Hardware - Version > 


" 32-bit Multiplicand reg, 32 -bit ALU, 64-bit Product 
reg, 32-bit Multiplier reg 


Control 









MultiplierO - 1 Multiplier0 = 0 









1a. Add multiplicand to 
place the result in 


product & 
Product register 









2. Shift the 1 bit. 


| 


3. Shift the Multiplier register right 1 bit. 
| 


Мо: < 32 repetitions 
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la. І->Р-Р--Мсапа 


2. 
3. 


Shr P 
Shr M’ier 


la. І->Р--Р--Мсапа 


2 


. Shr P 

. Shr M пег 
. О=>пор 
-Shr P 

. Shr M'ier 
. О=>пор 

. Shr P 

. Shr M пег 
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М ег; 
Mer: 
Mier: 
Mier: 

Miter: 
Mier: 
M "ег: 

Mier: 
Mier: 
Mier: 
Mier: 
Mier: 
Mier: 


0011 
0011 
0011 
0001 
0001 
0001 
0000 
0000 
0000 
0000 
0000 
0000 
0000 


Spring 2019 


Мсапа: 
Mcand: 
M cand: 
Meand: 


Meand: 
Meand: 
Мсапа: 


0010 





0010 
0010 
0010 


~ її чт cup 9 92 79 ЭҮ 


: 0000 0000 
: 0010 0000 
: 0001 0000 
: 0011 0000 
: 0001 1000 
: 0001 1000 
: 0000 1100 
: 0000 1100 
: 0000 1100 
: 0000 0110 
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Multiply Hardware - Version 3 


Product register wastes space that exactly matches size of 
multiplier 

— combine Multiplier register and Product register 

32-bit Multiplicand reg, 32-bit ALU, 64-bit Product reg, 
(O-bit Multiplier reg) 





Product = 1 ег а. ProductO = 0 
| Product . 


1a. Add multiplicand to the left half of product & 
place the result in the left half of Product register 


Shift right : Carry + HI + LO у 
2. Shift the Product register right 1 bit. 


No: < 32 repetitions 


q Yes: 32 repetitions 
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Multiplier 
NO Initial Mcand: 0010] P: 0000011) 

la. 1=>Р=Р+Мсапа Mcand: P: 0010 0011 

2. Shr P M cand: ( P: 0001 0001 

la. 1=>P=P+Mcand Мсапа: Р: 0011 0001 

2. Shr P Mcand: P: 0001 1000 

P: 0001 1000 

P: 0000 1100 

P: 0000 1100 

P: 0000 0110 





Example 
“” Consider: 11005 х 11015 , Product = 10011100, 
<+ 4-bit multiplicand and multiplier are used in this example 


+ 4-bit adder produces а 4-bit Sum + Carry bit 


Iteration Multiplicand Carry 


EPI Initialize (HI = 0, LO = Multiplier) 1100 — 


LO[0] 2 1 => ADD 


Shift Right (Carry, Sum, LO) by 1 bit 1100 
1 — 


O[0] = 0 => NO addition 


Shift Right (HI, LO) by 1 bit 1100 — 


Product = HI, LO 
0000 1101 
0110 0110 


2 
0011 0011 


Shift Right (Carry, Sum, LO) by 1 bit 1100 — 


LO[0] = 1 => ADD 


Shift Right (Carry, Sum, LO) by 1 bit 1100 


e 





HHI 


LO[0] = 1 => ADD 


1001 1100 


Observations on Multiply Version 3 
Ш 2 steps per bit because Multiplier & Product combined 
B МІР5 registers Hi and Lo are left and right half of Product 
Ш Gives us MIPS instruction MultU 
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E What about signed multiplication? 
Signed Multiplication 


To use version 1 & 2 as a Signed Multiplication 
Convert multiplier and multiplicand into positive numbers 
If negative then obtain the 2's complement and remember the sign 
Perform unsigned multiplication 
Compute the sign of the product 


3™ Version: We can use the 3rd version of the unsigned multiplication 
hardware to perform signed multiplication. 


When shifting right, extend the sign of the product 
If multiplier is negative, the last step should be a subtract 
* Case 1: Positive Multiplier 


Multiplicand 1100, = -4 
Multiplier x 0101, = +5 


Sign-extension | a 


111100 


Product 11101100. - -20 


* Case 2: Negative Multiplier 


Multiplicand 1100, = -4 
Multiplier x 1101, = -3 
Sign-extension | 15 енім 
(111100 
00100 (2's complement of 1100) 
Product 00001100. = +12 
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Sequential Signed Multiplier 
< ALU produces: 32-bit sum + sign bit 












32 bits 32 bits 


* Sign bit can be computed: 


add, sub 
+ No overflow: sign = sum[31] 





sign 





+ If Overflow: sign = ~sum[31] 


shift right 


64 bits 
LO[0] 


Positive 


* Positive | | 
j s, == ======== 
Overflow Negative (sign : 1) 
Inverse sign : 0 | = 
HI = 0, LO = Multiplier 







Ngative 
+ Negative 





Overflow Positive (sign : 0) =1 — 
Inverse sign : 1 





First 31 iterations: НІ = НІ + Multiplicand 





Last iteration: НІ = HI — Multiplicand 


Shift Right (Sign, HI, LO) 1 bit 


| 32nd Repetition? > 
l Yes 


Done 
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Example: 374 Version Signed Multiplication 
» Multiplicand = -30 = signed 6-bit = (100010), 


› Multiplier = 15 = signed 6-bit = (001111), 


› Productz -30 x 15 = - 450 = (111000111110), 


^ Initial Values: 


>» Multiplicand Register = MC is 6 bits = 100010. 


^ Product Register {HI,LO} 12-bit 


15 


3rd Version signed Multiplier — 30 x ; - 
nn == 


 — s Values 0 


1: PR[0] =1--> Hi=Hi+MC 
2: Shift right (HLLO) 1 bit 
1: PR[0] =1--> Ні-НІ-МС 
2: Shift right (HLLO) 1 bit 
1: PR[0| =1--> Hi-Hi- MC 
2: Shift right (HLLO) 1 bit 
1: PR[0| =1--> Hi=Hi+MC 
2: Shift right (HLLO) 1 bit 
1: PR[0] = 
Shift right (HLLO) 1 bit 


()--> 


Eš 
Eš 
É 
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MC = 100010 


000000 
100010 
110001 
010011 
101001 
001011 
100101 
000111 
100011 
100011 
110001 


ГО 
001111 
001111 
000111 
000111 
100011 
100011 
110001 
110001 
111000 
111000 
111100 
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1: PR[0]= 0--> 110001 111100 
Shift right (HLLO) 1 bit 111000 111110 





Result: PR= -30 х 15 = -450 = (111000111110); 


Ехатріе 
* Consider: 1100, (-4) x 1101, (-3), Product = 00001100, 
* Check for overflow: No overflow "> Extend sign bit 


* Last iteration: add 25 complement of Multiplicand 


o Initialize (HI = 0, LO = Multiplier) ЕТТЕ 1906001141 
дотор 1 reiner. 
ü авас а 
ss 


LO[0] = 0 => Do Nothing 


j —— | 
Shift (Sign, HI, LO) right 1 bit 1100 1111 0011 
LO[0] = 1 => ADD arcane | 1011]0011 
Shift (Sign, HI, LO) right 1 bit 1100 е ЖЕСЕ 
LO[0] = 1 => SUB (ADD 25 сотрі 0100 + | 0001 

‚ [Ой = 1 => SUB (40925 comp 
Shift (Sign, HI, LO) right 1 bit prec 0000 1100 
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Booth's Algorithm for signed multiplication 


(This algorithm was invented by Andrew Donald Booth in 1950). Booth's algorithm is а 
powerful algorithm that is used for signed multiplication. It generates a 2n bit product for two 
n bit signed numbers. 


е |n general, in the Booth scheme, -1 times the shifted 
multiplicand is selected when moving from Oto 1, and +1 


times the shifted multiplicand is selected when moving from 1 
to O, as the multiplier is scanned from right to left 


O O 1 O 1 1 O 0 1 1 1 O 1 O 1 1 0 0 


ы; 


0-1-1-1 0-1 0-1 O 0-1-421-1-1 0-1 O O 


Booth recoding of a multiplier 


Booth Multiplier Recording Table 





минары Version of multiplicand 
i | СЕ bi 
Bit;  Biti-1 selected by bit 
О O охм 


= XM 


Booth Algorithm Example for 
Negative Multiplier 





01101 (+13) 01 1 O 1 
х1 1010 (сб) = 0-1%1-1 0 
0000000009 
1 110011 
0001101 
0-1-1-1 0 1 11 0 0 1 1 
s 
2 22 21,9 000000 
= 0 -8+4-2 = -6 1 1 1 O 1 1 О О 1 о (-78) 
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32 bits 32 bits 


add, sub 








33 bits 





32 bits 





shift right 






64 bits 











— 00 ог 11 


HI = HI + Multiplicand HI = HI - Multiplicand 


Shift Right Product = (HI, LO) 1 bit 
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Example: (Multiplicand is negative) 


Booth's Algorithm Multiplier —30x 15 = -450 


MC = 30 =100010; 


Initial Values 

LO of pruduct = MR 

1: {PR[0], 0j = 10 

Subtract (НІ-НІ-МС) 011110 001111 
2: Shift right PR 001111 000111 
1: {РК[0], 1} 2 11 

Shift right 000111 100011 
1: {РК[0], 1} 2 11 

Shift right 000011 110001 
1: {PR[O], 1} 2 11 

Shift right 000001 111000 
1: {PR[0], 1} = 01 

ADD (НІ-НІ-МС) 100011 111000 
2: Shift right PR 110001 111100 
1: {PR[0], 0} = 00 

Shift right 1 111000 111110 
Result : PR= -30 x 15 = -450 = (111000111110); 
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Example: (Multiplier is negative) 
Multiply 14 times -5 using 5-bit numbers (10-bit result). 
14 in binary: 01110 (Multiplicand) 
-5 in binary: 11011 (Multiplier) 

Expected result: 14 x -5 = -70 in binary: 11101 11010 


Multiplier 


Multiplicand | Action upper 5-bits 0, 
lower 5-bits multiplier, 
1 "Booth bit" initially 0 









00000+10010=10010 
18: Subtract Multiplicand 
011160 18018 11611 6 
Shift Right Arithmetic 11661 01101 1 


11: No-op 11661 81181 1 
01110 
Shift Right Arithmetic 11100 10110 1 


11166+61116=61616 
(Carry ignored because adding a positive and 
81: Add Multiplicand negative number cannot overflow.) 







81018 10118 1 


Shift Right Arithmetic 80101 01011 6 


20101+10010=10111 





18: Subtract Multiplicand 





10111 01011 8 


Shift Right Arithmetic 11611 16161 1 
11: No-op 11611 16161 1 
81118 
Shift Right Arithmetic 11161 11616 1 
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Integer Multiplication 4 Division 


“» Consider ахр and a/b where а and b аге іп $s1 and %52 










<> Signed multiplication: mult $s1,$s2 

+ Unsigned multiplication: multu $s1,$s2 — — 
<> Signed division: div $s1,$s2 
<> Unsigned division: divu 551,552 EN 


*” For multiplication, result is 64 bits 
<> LO = low-order 32-bit and HI = high-order 32-bit 
“» For division 
< LO = 32-bit quotient and HI = 32-bit remainder 
<> If divisor is О then result is unpredictable 
“» Moving data 
<+ mflo rd (move from LO to rd), мғһі rd (move from HI to rd) 
 mtlo rs (move to LO from rs), mthi rs (move to HI from rs) 


Multiply 
Divide 





HI 


“” Signed arithmetic: mult, div (rs and rt are signed) 
^ LO = 32-bit low-order and HI = 32-bit high-order of multiplication 
< LO = 32-bit quotient and HI = 32-bit remainder of division 

> Unsigned arithmetic: multu, divu (rs and rt are unsigned) 


* NO arithmetic exception can occur 
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Combinational Multiplier (unsigned) 


X3 х2 хі хо +— multiplicand 
* ҮЗ Y2 Yl YO — multiplier 


X3YO X2YO хіүо хоуо | Partial products, one for each bit in 


T ХЗҮІ X2Yl Х1Ү1 XOYl multiplier (each bit needs just one 
- X3Y2 X2Y2 X1Y2 X0Y2 AND gate) 


+ X3Y3 X2Y3 X1Y3 ХОҮЗ 





x14 
zm 








ХЗҮО ХЗҮО 
+ X3Y1 X3Y1 
ж X3Y2 X3YZ2 
- X3Y3 X3Y3 
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Faster Multiplier 


* Moore's Law has provided so much more in resources that hardware 
designers can now build much faster multiplication hardware. 


4 Faster multiplications are possible by essentially providing one 32-bit 
adder for each bit of the multiplier: one input is the multiplicand ANDed 
with a multiplier bit, and the other is the output of a prior adder. 


% Fast multiplication hardware: Rather than use a single 32 bit adder 31 
times, the following hardware unrolls the loop to use 31 adders. 


% 32-bit adder for each bit of the multiplier Ж В 
+ 31 adders are needed for а 32-bit multiplier м | 
< AND multiplicand with each bit of multiplier P 
+ Product = accumulated shifted sum AN oF T 
% Each adder produces a 33-bit output Ms : = | 
+ Most significant bit is a carry bit | | 
<> Least significant bit is a product bit B. | s T 





“> Upper 32 bits go to next adder а фагы 


% Array multiplier can be optimized 


+ Carry save adders reduce delays | 
[y y Pas ap Pa Ps P> Py Po 
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Carry Save Adders 
“* A n-bit carry-save adder produces two n-bit outputs 


< n-bit partial sum bits and n-bit carry bits 
* All the n bits of a carry-save adder work in parallel 
<> Тһе carry does not propagate as In а carry-propagate adder 
< [his is why a carry-save Is faster than a carry-propagate adder 


* Useful when adding multiple numbers (as in multipliers) 





as. D34 a, b. а, bg азу D34 Сз; а, b4 с) ар бо Co 
Zu E шп п ; Ci, E NE ш 
$41 $4 50 C34 531 C4 S4 Со So 
Carry-Propagate Adder Carry-Save Adder 
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Carry-Save Adders in a Multiplier 
“> Suppose we want to multiply two numbers А and B 
< Example on 4-bit numbers: А = a, а, аң as and B = b. b- b, bg 
4 Step 1: AND (multiply) each bit of A with each bit of B 


< Requires n? AND gates and produces п? product bits 





«* Step 2 

«** ADD the product bits vertically using Carry-Save adders 
< Full Adder adds three vertical bits 
< Half Adder adds two vertical bits 


< Each adder produces a partial sum and a carry 


** Use Carry-propagate adder for final addition 





Ax B 
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Step 3: Use carry save adders to add the partial products 
+ Reduce the partial products to Just two numbers 


Step 4: Use carry-propagate adder to add last two numbers 


asb; a-b- азб; а,б; а: a,b, a4bg agb, agbg 


ye ie Ea 
—— =E 






Carry Save 


Carry Save Adder 






ЕНЕР du 
pee FH рУ 
[xT [mj a]. Carry Propagate Adder 










P, Р, P. P, Р» P; Р, Р, 


Summary of а Fast Multiplier 


+ 
++ 


A fast n-bit x n-bit multiplier requires: 
<> п? AND gates to produce п? product bits in parallel 


< Many adders to perform additions in parallel 


+ 
*ç 


Uses carry-save adders to reduce delays 


+ 
++ 


Higher cost (more chip area) than sequential multiplier 


ф 
“.” 


Higher performance (faster) than sequential multiplier 
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Unsigned Division 


10011. = 19 Quotient 





Divisor 1011. 11011001. = 217 Dividend 
-1011 | 
10. x x x Try to see how big a 
101 | number can be 
| . subtracted, creating a 
1010; : digit of the quotient on 
10100 each attempt 
Dividend = —1011 
Quotient x Divisor 1001 Binary division is 
+ Remainder 10011 жата v 
shifting апа subtraction 
217-19х11-8 -1011 
1000, = 8 Remainder 
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3 versions of divide, successive refinement 


DIVIDE HARDWARE Version 1 


° 64-bit Divisor reg, 64-bit ALU, 64-bit Remainder reg, 
32-bit Quotient reg 








Initialized to Q0 


32 bits 











Shift Left 





64-bit ALU y 






64 bits 


Control 


“Takes n+1 steps 
for n-bit Quotient & Rem. 


l. Subtract the Divisor register from the 
Remainder register, and place the result 
in the Remainder register. 












Remainder >= 0 2- 


2а. Shift the 2b. Restore the original value by adding the 
Quotient register Divisor register to the Remainder register, «с 
to the left setting place the sum in the Remainder register. Also 
the new rightmost shift the Quotient register to the left, setting 
bit to 1. the new rightmost bit to 0. 


3. Shift the Divisor register right! bit. 


No: < n+] repetitions 


1 Yes: п-ҒІ repetitions 


Dr. Ahmed Jaber Spring 2019 


172 


EX:7/2  Quotient:3 ,Remainder = 1 


Remainder 





























































































































3: Shift Div right 
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Divide Algorithm Version 1: 

7 (0111) /2 (0010) = 3 (0011) R 1 (0001) 
Step Remainder Quotient Divisor — Rem-Div 
Initial 00000111 0000 00100000 «0 

1 — 0000011. 0000 00010000 <0 

2 00000111 0000 00001000 <0 

3 00000111 0000 00000100 0000 0011 >0 
4 
5 





00000011 . 0001 00000010 0000 0001 >0 
0000 0001 001 00000001 


Final 1 3 


(1 Observations оп Divide version 1: 
(1 Half the bits in divisor always 0 
= 1/2 of 64-bit adder is wasted 


-> 1/2 of divisor register is wasted 


О Intuition: instead of shifting divisor to right, shift remainder to 
left... 


(1 Step 1 cannot produce a 1 in quotient bit — as all bits 
corresponding to the divisor in the remainder register are O 
(remember all operands are 32-bit) 


L] Intuition: switch order to shift first and then subtract - can save 1 


iteration... 
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DIVIDE HARDWARE Version 2 


° 32-bit Divisor reg, 32-bit ALU, 64-bit Remainder reg, 
32-bit Quotient reg 






ii 


32 bits 





32 hits 









Shift Left 


Shift Left 


Remainder : Control 


2. Subtract the Divisor register from the 
left half of the Remainder register, & place the 
result in the left half of the Remainder register. 





Remainder >= 0 Remainder < () 





3a. Shift the 3b. Restore the original value by adding the Divisor 
Quotient register | | register to the left half of the Remainderregister, 

to the left setting &place the sum in the left half of the Remainder 
the new rightmost] | register. Also shift the Quotient register to the left, 
bit to 1. setting the new least significant bit to 0. 





No: < n repetitions 





{ Yes: n repetitions (n = 4 here) 
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DIVIDE HARDWARE Version 3 


° 32-bit Divisor reg, 32 -bit ALU, 64-bit Remainder reg, 
(0-bit Quotient reg) 


Divisor 







32 hits 


| Shift Left | 
Remainder : (Quotient) 









_Start: Place Dividend in Remainder 


l. Shift the Remainder register left 1 bit. 


2. Subtract the Divisor register from the 
left half of the Remainder register, & place the 
result in the left half of the Remainder register. 






Remainder >= 0 Remainder < 0 





За. Shift the 3b. Restore the original value by adding the Divisor 
Remainder register register to the left half of the Remainder register, 

to the left setting &place the sum in the left half of the Remainder 
the new rightmost register. Also shift the Remainder register to the 

bit to 1. left, setting the new least significant bit to 0. 


No: < n repetitions 





Yes: n repetitions (n = 4 here) 


| Done. Shift left half of Remainder right 1 bit. | 
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7 (0111) / 2 (0010) = 3 (0011) R (0001) 


Step Remainder Divisor Rem-Div 
Initial OOOO 0111 0010 Always < 0 
< 

Shift 0000 1110 0010 <0 

1 0001 1100 0010 <0 

2 0011 1000 0010 0 

3 0011 0001 0010 0011-0010 > 0 
4 0010 0011 0010 


mM ome 


Final R1 3 
MIPS Division 


Q Use HI/LO registers for result 
v 32-bit remainder in ні register 
v  32-bit quotient in Lo register 
ОЈ Instructions 
Y divrs rt / divurs, rt 
v overflow is ignored 


0 Use mfhi, mflo to access result 
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Divisions involving Negatives 
¢ Simplest solution: convert to positive and adjust sign later 


* Note that multiple solutions exist for the equation: 
Dividend = Quotient x Divisor + Remainder 


+f div +2 Quo = +3 Rem = +1 
-7 div +2 Quo = -3 Rem = -1 
+7 div -2 Quo = -3 Rem = +1 
-7 div -2 Quo = +3 Rem = -1 


Convention: Dividend and remainder have the same sign 
Quotient is negative if signs disagree 
These rules fulfil the equation above 


Signed Division 
m Simplest way is to remember the signs 
m Convert the dividend and divisor to positive 
B Do the unsigned division 


m Compute the signs of the quotient and remainder 
B Quotient sign = Dividend sign ХОН Divisor sign 
B Remainder sign - Dividend sign 


B To summarize, if dividend is negative, then two's complement must 
be applied to the remainder at the end. If the dividend and the divisor 


have different signs, then the quotient must be negated with 2's 
complement operation at the end. 
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Number Systems 


m For what kind of numbers do you know binary representations? 


" Positive integers 
Unsigned binary 


" Negative integers 
Sign/magnitude numbers 
Two's complement 


— Integers: 10011101. (binary point to right of LSB) 
* For 32-bits, unsigned range is О to 74 billion 


— Fractions: .10011101 (binary point to left of MSB) 
* Range [O to 1) 


Fractions: Two Representations 


m Fixed-point: binary point is fixed 
1101101.0001001 
m Floating-point: binary point floats to the right of the most 
significant 1 and an exponent is used 
1.1011010001001 x 26 
B Floating-point numbers have two advantages over integers. First, they can 


represent values between integers. Second, because of the scaling factor, they can 
represent a much greater range of values 


Floating Point Numbers 
" [he largest 32 bit unsigned integer number is 
1111 1111 1111 1111 1111 1111 1111 1111 = 
4,294,967,295 


= What if we want to encode the approx. age of the earth? 
4,600,000,000 or 4.6 x 10? 


= or the weight in kg of one a.m.u. (atomic mass unit) 
0.0000000000000000000000000166 ог 1.6x 107?" 


" [here is no way we can encode either of the above in a 32- 
bit integer. 
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m The term floating point (real number) іс derived from the fact that there is по 
fixed number of digits before and after the radix point (Ex. decimal point); that is, the 
decimal point can float. 


In decimal the number: 123.456 represented as 1.23456x 102 
In hexadecimal number: 123.abc represented аѕ 1.23abc х 16° 
In binary number: 10100.110 represented as 1.0100110 x 2^ 


B Representation for non-integral numbers. 


B Including very small and very large numbers (positive or negative) 


os -1.23х10-88 +, > 1.23«10° "^ 
k Ga (very close to О) ` : (very close to 0) = P di 
— oo ------------------І----Н-------------------і-- "ға 
-1 e +1 
(--1x10?) (21x10?) 


Floating-point Numbers (Decimal) 


** We use a scientific notation to represent 
$ Very small numbers (e.g. 1.0 x 10”) 
$ Very large numbers (e.g. 8.64 x 10) 
* Scientific notation: + d . fraction x 10 ^ Pere" 


m decimal scientific notation: 
= For example, 2731o in scientific notation is 
273 = 2.73 x 104 


m in general, a number is wrritten in scientific notation as: 
+ М x BE 
Where: 
= [Л = mantissa 
= B= base 


= Е = exponent 


m Inthe example, M = 2.73, В = 10, апа Е = 2 
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% Floating-point numbers should be normalized 
+ Exactly one non-zero digit should appear before the point. 
" [n a decimal number, this digit can be from 1 to 9 
= [n a binary number, this digit should be 1 
+ Normalized FP Numbers: 
+ 5.941x10? 


Examples of Normalized Floating Point 
Numbers 


These are normalized: 

е +1.23456789 x 10! 

e -9.9S 7654321 x 101 

e —+5.0 x 10° 

These are not normalized: 


е +11.3 x 102 siemnificamnd > radix 
- -0.0002 x 107 siznificand = 1.0 
e ДО >< 10% exponent not integer 


In binary 


= +1. ххххххх- >< УУУУ Where x and y are binary 
3210 = 1000002 = 10x 25 = 0.1 x 2° 
0.062540- 0.00012 = 1.0 x24 = 0.1 x 2-2 


26.625102 11010.1012 = 1.1010101 x 2^2 0.11010101 x 2° 
_J Binary representation 
0 (-1)"9" Ж significand Ж 2*xPenent, (e.g. -101.001101 * 2111001) 
more bits for significand gives more accuracy 
more bits for exponent increases range 
if 1< significand < 10,,,(—2,,,) then number is normalized, 


E.g., -101.001101 * 2111901 = -1,.01001101 2551591 (normalized) 


( D UL 
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Floating-Point Representation 
1 bit 8 bits 23 bits 


ЕШ _| 
S Exponent Fraction 


< Sis the Sign bit (0 is positive and 1 is negative) 





< E is the Exponent field (signed) 
< — Very large numbers have large positive exponents 
< Very small close-to-zero numbers have negative exponents 
< Моге bits in exponent field increases range of values 
< Fis the Fraction field (fraction after binary point) 
< Моге bits in fraction field improves the precision of FP numbers 


Floating-Point Representation 1 
m Convert the decimal number to binary: 
228,90 = 11100100, = 1.11001 x 27 


m Fill in each field of the 32-bit number: 
" The sign bit is positive (O) 
" The 8 exponent bits represent the value 7 
= The remaining 23 bits are the mantissa 


1 bit 8 bits 23 bits 





о | 00000111 11 1001 ОООО ОООО ОООО ОООО 
Sign Exponent Mantissa 


Floating-Point Representation 2 


m First bit of the mantissa is always 1: 
228,090 = 112100100, = 1.11001 x 27 
= Thus, storing the most significant 1, also called the implicit leading 1, is 
redundant information 


Instead, store just the fraction bits in the 23-bit field 
The leading 1 is implied 





1 bit 8 bits 23 bits 
O 00000111 110 0100 ОООО ОООО ОООО ОООО 
Sign Exponent Fraction 


Dr. Ahmed Jaber Spring 2019 


IEEE 754 Floating-Point Standard 


$* Found in virtually every computer invented since 1980 
< Simplified porting of floating-point numbers 
~ Unified the development of floating-point algorithms 
<> Increased the accuracy of floating-point numbers 
“ Single Precision Floating Point Numbers (32 bits) 
< 1-bit sign + 8-bit exponent + 23-bit fraction 


+ Double Precision Floating Point Numbers (64 bits) 


+ 1-bit sign + 11-bit exponent + 52-bit fraction 





Normalized Floating Point Numbers 


+*+ For a normalized floating point number (5, E, F) 





> Significand is equal to (1.2). = (1.F,Fsfsfa...)o 
< IEEE 754 assumes hidden 1. (not stored) for normalized numbers 


<> Significand is 1 bit longer than fraction 
* Value of a Normalized Floating Point Number: 
+ (1 _ Р). ж 2exponent value 


+ (1 „Ё, ДЕР il J> ж 2exponent value 





+ (1 + f,<x2-1 + [x27 + Іх2-3 + f, 2-4 ...)o м 2exponent value 


S = Ü is positive, S = 1 Is negative 
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Biased Exponent Representation 


* How to represent a signed exponent? Choices аге ... 
+ Sign + magnitude representation for the exponent 
+” Two's complement representation 
<> Biased representation 

<, IEEE /54 uses biased representation for the exponent 
<> Exponent Value = E — Bias (Bias is a constant) 

“ [he exponent field is 8 bits for single precision 
<> E сап be in the range 0 to 255 
< E = О and E = 255 are reserved for special use (discussed later) 
< E= 1 to 254 are used for normalized floating point numbers 
< Bias = 127 (half of 254) 
+ Exponent value = E — 127 Range: -126 to +127 


“» For double precision, the exponent field is 11 bits 
<> E can be in the range 0 to 2047 
< E = О and E = 2047 are reserved for special use 
< E = 1 to 2046 are used for normalized floating point numbers 
< Bias = 1023 (half of 2046) 
<> Exponent value - E — 1023 Range: -1022 to +1023 


* Value of a Normalized Floating Point Number 15 


+ (1 .F)5 x 2(E-Bias) 
(1. faf, ...), x 20 Pres) 





+ (1 + x21 + F,<x2-2 + f.x2- + f,x2- y, x 20 Bas) 


S = Ü is positive, S = 1 Is negative 
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Examples of Single Precision Float 


*> What is the decimal value of this Single Precision float? 





>” Solution: 
<> Sign = 1 is negative 
^ Е = (01111100), = 124, E — bias = 124 — 127 = —3 
<> Significand = (1.0100... 0), = 1 + 2? = 1.25 (1. is implicit) 
<> Value in decimal = —1.25 x 2-5 = —0.15625 
** What is the decimal value of? 


< Solution: -—— 


< Value in decimal = +(1.01001100 ... 0) x 2139-1727 = 
(1.01001100 ... 0) х 2° = (1010.01100 ... 0), = 10.375 


Examples of Double Precision Float 


“» What is the decimal value of this Double Precision float ? 





“» Solution: 
4 Value of exponent = (10000000101). — Bias = 1029 — 1023 = 6 
<> Value of double = (1.00101010 ... 0), x 25 (1. is implicit) = 
(1001010.10 ... 0), = 74.5 


*• What is the decimal value of ? 





“» Do it yourself! (answer should be —1.5 x 2-7 = —0.01171875) 
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Example 
= Represent —0./5 in IEEE 754 FP 
-0.75 =(-1) x 1. x 2 
$-1 
Fraction = 1000...00, 


Exponent = —1 + Bias 
Single: -1 + 127 = 126 = 01111110 
Double: -1 + 1023 = 1022 = 01111111110, 


= Single: 1011111101000...00 
= Double: 1011111111101000...00 


Example 


What number is represented by the single-precision float 
11000000101000...00 
5-1 
Fraction = 01000...00; 
Exponent = 10000001; = 129 
Ш x = (—1)! х (1 +. 01) х 2(129-127) 
= (-1) x 1.25 x 2° 
= —5.0 


Representing Values Representing Values 
-12.43750---1100.0111; -12.4375,o = -1 100.0111, 





Short: -1.100011 = 0000, х 23 +127 ione 21100011 (to. 0000, x 23 +1023 
1 10000010 10001110000... 0000 1 10000000010 10001110000... 0000 








1100 0001 0100 0111 0000 ... 0000; 


1100 ОООО 0010 1000 1110 ОООО... 0000, 
— C1470000h 


= CO28E00000000000h 
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ЕХ 
Write the value -58.25,, using IEEE 754 32-bit floating-point standard 


m First, convert the decimal number to binary: 
58.25; = 111010.01, = 1.1101001 x 25 


m Next, fill in each field in the 32-bit number: 
" Sign bit: 1 (negative) 
= 8 exponent bits: (127 + 5) = 132,, = 10000100, 
= 23 fraction bits: 110 1001 0000 0000 0000 0000. 


ibit — 8 bits 23 bits 
10000100 | 110 1001 0000 0000 0000 0000 
Sign Exponent Fraction 


In hexadecimal: 0хС2690000 
Example 


* The decimal number -2345.125,, is to be represented in the 
IEEE 754 32-bit single precision format: 
-2345.125,, = -100100101001.001, (converted to binary) 
= -1.00100101001001 x 2!! (normalized binary) 
The mantissa is negative so the sign S is given by: 
S= 
* The biased exponent E is given by Е-е + 127 
Е-11--:127 = 138,, = 10001010, 
е Fractional part of mantissa М: 
M = .00100101001001000000000 | (in 23 bits) 
The /EEE 754 single precision representation is given by: 


Hidden 
= 









a 10001010 | 00100101001001000000000 


s| E | M» 0] 


1 bit 8 bits 23 bits 
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Smallest Normalized Float 


«* What is the smallest (in absolute value) normalized float? 


> Solution for Single Precision: 


<> Exponent — bias = 1— 127 = -126 (smallest exponent for SP) 






<> Sianificand = (1.000 ... 0), = 1 
<> Value in decimal = 1 x 2-126 = 1.17549 .. 


. x 10-39 


“» Solution for Double Precision: 





‚ x 10-308 


+ Value in decimal = 1 x 2-192 2.22507 .. 
“» Underflow: exponent is too small to fit in exponent field 


Largest Normalized Float 


<• What is the Largest normalized float? 


> Solution for Single Precision: 


“> E—bias = 254 — 127 = *12/ (largest exponent for SP) 


< Sianificand = (1.111 ... 1); = 1.99999988 = almost 2 
4 Value in decimal = 2 x 25127 = 2+128 = 3.4028 ... х 10*?? 


“» Solution for Double Precision: 


x 10*308 


+ Value in decimal = 2 x 211025 = 241024 = 1.79769 ... 


“» Overflow: exponent is too large to fit in the exponent field 
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Zero, Infinity, and NaN 


4 Zero 
<> Exponent field E = 0 and fraction F = 0 
4- +0 and -Ə are both possible according to sign bit 5 
> Infinity 
< Infinity is a special value represented with maximum E and F = 0 
= For single precision with 8-bit exponent: maximum E = 255 
" For double precision with 11-bit exponent: maximum E - 2047 
<> Infinity can result from overflow or division by zero 
< +o and -æ are both possible according to sign bit 5 
* NaN (Not a Number) 
<> NaN is a special value represented with maximum E апа F = 0 
<> 0/0 => NaN, 0 x о > NaN, sart(-1) > NaN 
+ Operation on а NaN is typically a NaN: Op(X, NaN) > NaN 








Denormalized Numbers 


4$ IEEE standard uses denormalized numbers to ... 
< Fill the gap between О and the smallest normalized float 
A Provide gradual underflow to zero 
“» Denormalized: exponent field E is О and fraction F = 0 
<> The Implicit 1. before the fraction now becomes О. (denormalized) 


> Value of denormalized number ( S, 0, Е) 


Single precision: + (0.Ғ)- x 2-126 





Double precision: + (0.2), x 271942 


Negative Negative Positive Positive 
Overflow Underflow , Underflow Overflow 
pur --------ү----. puru 
= Denorm ! Denorm +оо 


| -2128 ——128 О 2—128 2128 


< 
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Summary of IEEE 754 Encoding 
Single-Precision | Exponent = 8 | Fraction = 23 | Value 


+ (1.3 = 25-127 
ГӘепогтайгесі Number | o | nonzero ` 
zeo оо  — 
NN — . | 255 | nonzero 


Double-Precision | Exponent = 11 | Fraction = 52 | 


Normalized Number 1 to 2046 Anything + (1.6). x 2E- 1023 








Denormalized Number Еа 1 nonzero + (O.F)x2- 1022 

Zero | 0 j 09 

Infinity 2047 | o 

мам o 2047 nonzero 

Negative Negative Positive Positive 

Overflow Underflow | Underflow Overflow 
== Denorm ' Denorm жеше 

-2128 -2—1286 O 3-126 2128 | 


Some Example IEEE-/54 Single-Precision Floating-Point Numbers 


Floating-Point Number Single-Precision Representation 
Ҹам 


L35 


0 оооооооо O0000000000000000000000 
 *Infinity = = |о/ 11111111 000000000000000000000900 
0/1 11111111 any nonzero significand 





If the real exponent of a number is X then it is represented as (X + bias). IEEE single-precision uses а 
bias of 127. Therefore, an exponent of 


-] is represented as -1 + 127 = 126 = 01111110; 
0 is represented as 0 + 127 2 127 = 01111111; 
+1 is represented as +1 + 127 = 128 = 10000000; 
+5 is represented as +5 + 127 = 132 = 10000100; 
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Flouting Point Addition 
О Consider a 4-digit decimal example 
v 9.999 x 10! + 1.610 x 10^! 
1 | Align decimal points 
v Shift number with smaller exponent 
v 9.999 x 10! + 0.016 x 10! 


2 | Ааа significands 
¥ 9:999 x 10" + 0:016 x 10° = 10.015 x 10: 


н Normalize result & check for over/underflow 
v 1.0015 x 10? 

4 Round and renormalize if necessary 
¥ 1002 x 10° 


Flouting Point Addition 

О Now consider а 4-digit binary example 

v 1.000, x 2! + -1.110, x 27? (0.5 + -0.4375) 
1) Align binary points 

v Shift number with smaller exponent 

v 1.000, x 2" + -0.111, x 21 
2 | Add significands 

¥ 1.000; < 2% 4--0.111; x 27 = 0:001; x 27? 
3 | Normalize result & check for over/underflow 

v 1.000, x 27%, with no over/underflow 
4 | Round and renormalize if necessary 

v 1.000, x 27% (no change) = 0.0625 
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FP Adder 


О Algorithm 


Dr. Ahmed Jaber 





1. Compare the exponents of the two numbers. 
Shift the smaller number to the right until its 
exponent would match the larger exponent 






2. Add the significands 


З. Normalize the sum, either shifting right and 
incrementing the exponent or shifting left 
and decrementing the exponent 


4. Round the significand to the appropriate 
number af bits 


No „27 
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ЕР Adder Hardware 
B Much more complex than integer adder 
B Doing it in one clock cycle would take too long 
B Much longer than integer operations 
B Slower clock would penalize all instructions 
B FP adder usually takes several cycles 


Ш Can be pipelined 


Exponent 
difference 


shift right 










Increment or — | 
decrement Shift left or right 


г Rounding hardware 


Significand 
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exponents 


shift smaller 
number right 


Add 


Normalize 


Round 


193 


Ехатріе 
> Consider adding: (1.111), x 27! + (1.011), x 27 
< For simplicity, we assume 4 bits of precision (or 3 bits of fraction) 
“” Cannot add significands ... Why? 
* Because exponents are not equal 
$* How to make exponents equal? 
<> Shift the significand of the lesser exponent right 
until its exponent matches the larger number 
> (1.011), x 2-3 = (0.1011), x 2-2 = (0.01011), х 27 
<> Difference between the two exponents = —1 — (—3) = 2 
+ So, shift right by 2 bits 1.111 


4% Now, add the significands: аниа 


Carry — 10.00111 





< So, (1.111), х 2! + (1.011), х 2? = (10.00111), х 27! 
< However, result (10.00111), x 2-! is NOT normalized 
$ Normalize result: (40.00111), x 271 = (1.000111), x 2° 


< In this example, we һауе a carry 
+ So, shift right by 1 bit and increment the exponent 


% Round the significand to fit in appropriate number of bits 
<> We assumed 4 bits of precision or З bits of fraction 
** Round to nearest: (1.000111); = (1.001)5 


<> Renormalize if rounding generates a carry 





** Detect overflow / underflow 


<> If exponent becomes too large (overflow) or too small (underflow) 
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Ехатріе 
+ Consider: (1.000), x 2? — (1.000), x 22 


< We assume again: 4 bits of precision (or З bits of fraction) 
“ Shift significand of the lesser exponent right 

~ Difference between the two exponents = 2 — (—3) = 5 

* Shift right by 5 bits: (1.000); x 2-5 = (0.00001000), х 22 
“ Convert subtraction into addition to 2's complement 





Sign “1 

S Since result is negative, convert 
© } result from 2's complement to 
8 sign-magnitude 
© 
2 2's Complement 








+ So, (1.000), x 2-3 — (1.000), х 222 — 0.11111, х 22 
“% Normalize result: — 0.11111, x 22 = — 1.11115 х 2! 
<> For subtraction, we сап have leading zeros 


+ Count number z of leading zeros (in this case z = 1) 


+ Shift left and decrement exponent by z 


% Round the significand to fit in appropriate number of bits 


+ We assumed 4 bits of precision or З bits of fraction 
* Round to nearest: (1.1111), = (10.000); 
* Renormalize: rounding generated a carry 
—1.11115 x 2! = 210.000, x 2! = -1.000, x 22 


< Result would have been accurate if more fraction bits are used 
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Example 


* Consider Adding Single-Precision Floats: 
1.11100100000000000000010, x 2* 
+ 1.10000000000000110000101, x 2° 
* Cannot add significands ... Why? 
<> Because exponents are not equal 
* How to make exponents equal? 
~~ Shift the sianificand of the lesser exponent right 
<> Difference between the two exponents-4—2-2 
<> So, shift right second number by 2 bits and increment exponent 


1.10000000000000110000101, x 2^ 
— 0.01100000000000001100001 01; x 2* 


> Now, ADD the Sianificands: 


1.11100100000000000000010 x 2^ 
+ 1.10000000000000110000101 x 2^ 
1.11100100000000000000010 x 2^ 


+ 0.01100000000000001100001 01 x 2* (shift right) 
10.01000100000000001100011 01 x 2* (result) 
“” Addition produces a carry bit, result is NOT normalized 


>” Normalize Result (shift right and increment exponent): 
10.01000100000000001100011 01 x 24 
— 1.00100010000000000110001 101 x 2? (normalized) 
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Rounding 


> Single-precision requires only 23 fraction bits 
+ However, Normalized result can contain additional bits 


1.00100010000000000110001 | (1,401; x 25 
Round Bit: R —1 — ` Sticky Bit: 5 = 1 
“» Two extra bits are used for rounding 


<> Round bit: appears just after the normalized result 


<> Sticky bit: appears after the round bit (OR of all additional bits) 
* Since RS = 11, increment fraction to round to nearest 


1.00100010000000000110001 x 2? 
+1 
1.00100010000000000110010 x 25 (Rounded) 


Rounding to Nearest Even 


* Normalized result has the form: 1. f4 fs ... f, R S 
+ The round bit R appears immediately after the last fraction bit f, 
<> The sticky bit S is the OR of all remaining additional bits 
4$ Round to Nearest Even: default rounding mode 
* Four cases for RS: 
+ RS = 00 > Result is Exact, no need for rounding 
< RS = 01 > Truncate result by discarding RS 
^ RS = 11 > Increment result: ADD 1 to last fraction bit 
< RS = 10 - Tie Case (either truncate or increment result) 
= Check Last fraction bit f, (7: for single-precision or fz; for double) 
= [f f; is О then truncate result to keep fraction even 


[E] 


= If f, is 1 then increment result to make fraction even 


[E] 
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Floating-Point Multiplication 


Consider a 4-digit decimal example 
1.110 x 1010 x 9.200 х 107? 
1. Add exponents 
For biased exponents, subtract bias from sum 
New exponent = 10 + —5 = 5 
2. Multiply significands 
1.110 x 9.200 = 10.212 — 10.212 x 10? 
3. Normalize result & check for over/underflow 


1.0212 x 106 

4. Round and renormalize if necessary 
1.021 x 10° 

5. Determine sign of result from signs of operands 
+1.021 x 106 


Now consider a 4-digit binary example 
1.000, х 2-1 x —1.1105 х 272? (0.5 х —0.4375) 
1. Add exponents 
Unbiased: —1 + —2 = —3 
= Biased: (-1 + 12/)+ (—2 + 12/)= —3 + 254 — 12/ = —3 + 12/ 
2. Multiply significands 
- 1.000, x 1.110, = 1.110, = 1.110, x 2-3 
3. Normalize result & check for over/underflow 
1.1105 x 2- (no change) with no over/underflow 
4. Round and renormalize if necessary 
1.1105 x 2-3 (no change) 
о. Determine sign: +ve x —ve — —ve 
| —1.110, x 2-3 --0.21875 
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FP Arithmetic Hardware 


FP multiplier is of similar complexity to FP 
adder 


But uses a multiplier for significands instead of 
an adder 


ЕР arithmetic hardware usually does 
Addition, subtraction, multiplication, division, 
reciprocal, square-root 
ЕР <> integer conversion 

Operations usually takes several cycles 
Can be pipelined 


FP instructions in MIPS 


FP hardware Is coprocessor 1 

Adjunct processor that extends the ISA 
oeparate FP registers 

32 single-precision: $f0, $f1, ... $f31 

Paired for double-precision: $f0/$f1, $f2/$f3, ... 
FP instructions operate only on FP registers 

Programs generally don't do integer ops on FP data, 

or vice versa 
FP load and store instructions 

Iwc1, Тас1, swc1, sdc1 

e.g., 1ас1 $f8, 32($sp) 
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ЕР Instructions іп MIPS 


Singie-precision arithmetic 
. add.s, sub.s,mul.s, div.s 

e.g., add.s $fO, %Ғ1, $f6 
Double-precision arithmetic 
= add.d, sub.d, mul.d, div.d 

e.g.,mul.d $f4, $f4, %Ғ6 
oingle- and double-precision comparison 
= C.XX.S, C. xx. d (xx is eq, It, le, ...) 
= sets or clears FP condition-code bit 

e.g. c.1t.s $f3, $f4 
Branch on FP condition code true or false 
= bc1t, bc1Ff 

e.g., Бсіт TargetLabel 


FP Example: °F to °G 


C code: 
float f2c (Tfloat fahr) 
return ((5.0/9.00%(Ғаһғг - 32.0)); 


= fahr in $f12, result in %ҒО, literals in global memory 
space 
Compiled MIPS code: 
f2c: імсі 9%4Ғ16, const5(C$gp) 
Iwcl %4Ғ18, const9C$gp) 
div.s $16, $f16, $18 
Iwcl $18, const32C$gp) 
sub.s $18, $f12, $f18 
mul.s %ҒО, $f16, $18 
Jr Фга 
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Chapter Four 
The Processor (Datapath and Control) 


The Five Classic Components of a Computer 


> Processor (CPU): The active part of the computer that does all the work (data 


Processor 








manipulation and decision-making) 


» Datapath: Consists of the functional units of the processor. (Show next figure) 
* Elements that hold data. 


e Program counter, register file, instruction memory, etc. 
* Elements that operate on data. 


e ALU, adders, etc. 


* Buses for transferring data between elements. 


» Control unit: The control unit is responsible for setting all the control signals so 


that each instruction 1s executed properly. 
e The control unit's input 1s the 32-bit instruction word. 


e The outputs are values for the blue control signals 1n the datapath as show In fig 
below. 


e Most of the signals can be generated from the instruction opcode alone, and not 
the entire 32-bit word. 
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Datapath and Control 


e Datapath based on data transfers required to perform 
instructions 
* Controller causes the right transfers to happen 


registers 


C 
eo 
— 
e 
— 
L 
ы 
on 
c 


opcode, funct 





Branch 








ALU = 


Data . | 
Register & MemWrite 
| Address 


Registers pas ый 


Register # a Zero Data 
Г memory 
Те Register # RegWrite | 
MemhRead 























+ Address Instruction Hi 





Instruction 
memory 












Control 
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1 CPU performance factors I-Count 
Y Instruction count 
Y Determined by ISA and compiler 
v CPI and Cycle time 
Y Determined by CPU hardware 


CPI 4 Cycle 


Performance = 1 / Execution time simplified to 1 / CPU execution time 


CPU execution time = Instructions x CPI / (Clock rate) 





CPU Clocking 
= For each instruction, how do we control the flow of 
information though the datapath? 
- Single Cycle CPU: All stages of an instruction 
completed within one long clock cycle 


— Clock cycle sufficiently long to allow each instruction to 
complete all stages without interruption within one cycle 


1. Instruction 2. Decode 5. Req. 
Fetch Register Vi ESCAI ет. Y wwrite 
Read 


(фә 41 


- Alternative multiple-cycle CPU: only one stage of instruction 
per clock cycle 
— Clock is made as long as the slowest stage 


1. Instruction 2. Decocde/ 3. Execute 4. Memory 5. Register 
Fetch Register VV rite 


Read 


BJ ott tt l j| tI L 


— Several significant advantages over single cycle execution: 
Unused stages in a particular instruction can be skipped 
OR instructions can be pipelined (overlapped) 
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Designing a Processor: Step-by-Step 
4» Analyze instruction set => datapath requirements 
< The meaning of each Instruction is given by the register transfers 
< Datapath must include storage elements for ISA registers 
< Datapath must support each register transfer 
4» Select datapath components and clocking methodology 
4» Assemble datapath meeting the requirements 
4» Analyze implementation of each instruction 
<> Determine the setting of control signals for register transfer 


4» Assemble the control logic 


Review of MIPS Instruction Formats 


$* All instructions are 32-bit wide 
“ Three instruction formats: R-type, l-type, and J-type 


p° Rs® Rt? Rd5 ЕЕЕ 


+ Op®: 6-bit opcode of the instruction 

+ Rs”, КБ, Rd*: 5-bit source and destination register numbers 
< sa*: 5-bit shift amount used by shift instructions 

<> funct®: 6-bit function field for R-type instructions 

+» immediate'®: 16-bit immediate value or address offset 


< immediate?*: 26-bit target address of the jump instruction 
Dr. Ahmed Jaber Spring 2019 
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MIPS Subset of Instructions 


“” Only a subset of the MIPS instructions are considered 


<> ALU instructions (R-type): add, sub, and, or, xor, slt 








< Immediate instructions (l-type): addi, siti, andi, ori, xori 








<> Load and Store (l-type): lw, sw 





<> Branch (l-type): beg, bne 





<> Jump (J-type): | 
<% This subset does not include all the integer instructions 
<• But sufficient to illustrate design of datapath and control 


> Concepts used to implement the MIPS subset are used 
to construct a broad spectrum of computers 


Details of the MIPS Subset 


Instruction | Meaning | Format 

add rd,rs,rt | addition | |(орб-0| rs? | rb | га | O | 0x20 _ 
sub rd,rs,rt | subtraction — |op?- O| rs? | rb | га | 0 | 0x22 

andi ri rs, im 
| 
bne rs, rt, im!6 


іл) cn| &n| Cn| cn| іл) лр coy coy coy oy (л 
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LJ Use the program counter (PC) ғо read instruction address 
LJ Fetch the instruction from memory and increment РС 


О Use fields of the instruction to select registers to read 





LJ Execute depending on instruction class 
ГІ Use ALU to calculate 
v Arithmetic result 
v Memory address for load/store 
v Branch target address 
Ш Access data memory for load/store 
LJ PC — target address or PC + 4 


Ex-Execute/ 


MEM: Memory access | WB: Write back 
address calculation 


ID: Instruction decode/ 
l register file read 


IF: Instruction fetch 








Instruction Send an address to the instruction memory 
Fetch Read the instruction (IMEM[PC 


Generate the control signal values using the opcode & funct fields 
Read the register values with the relevant fields and generate the 
immediate 


Perform arithmetic / logical operations and branch comparison 


Memory Read from / write to the data memory (DMEM) 


os = ' | са, ' Tt 4 Ты ш i Р: : + zh 
үс the ALU result / the memory load / PC + 4 to the 
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Requirements of the Instruction Set 


“» Memory 
< Instruction memory where instructions are stored 
<> Data memory where data Is stored 
* Registers 
+ 32 x 32-bit general purpose registers, КО Is always zero 
+ Read source register Rs 
< Read source register Ht 


< Write destination register Rt or Rd 
«* Program counter PC register and Adder to increment PC 
“» Sign and Zero extender for immediate constant 


“» ALU for executing instructions 


Pegister Transfer Level (RTL) 


*> RTL is a description of data flow between registers 
“ КТІ gives a meaning to the instructions 
4» All instructions are fetched from memory at address PC 


Instruction RTL Description 


ADD Reg(Rd)<« Reg(Rs) + Reg(Rt); РС — РС + 4 
SUB Reg(Rd)— Reg(Rs) — Reg(Rt); PC — РС +4 
ORI Reg(Rt) — Reg(Rs) | гего ext(lm106); РС — PC + 4 
L VV Reg(Rt) — MEM [Reg(Rs) + sign_ext(Im16)]; PC — PC + 4 
SW MEM[Reg(Rs) + sign_ext(Im16)] — Reg(Rt); PC — PC + 4 
BEQ if (Reg(Rs) == Reg(Rt)) 


PC — РС +4 +4 x sign_extend(Im16) 
else PC — PC + 4 
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Instructions are Executed in Steps 


“+ H-type 


* 


oe l-type 


+ 


* BEQ 


4 LW 


+ SW 


+ Jump 
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Fetch instruction: 
Fetch operands: 


Execute operation: 


Write ALU result: 
Next PC address: 


Fetch instruction: 
Fetch operands: 


Execute operation: 


Write ALU result: 
Next PC address: 


Fetch instruction: 
Fetch operands: 
Equality: 

Branch: 


Fetch instruction: 
Fetch base register: 
Calculate address: 
Read memory: 
Write register Rt: 
Next PC address: 


Fetch instruction: 
Fetch registers: 
Calculate address: 
Write memory: 
Next PC address: 


Fetch instruction: 
Target PC address: 
Jump: 


Instruction — МЕМ[РС] 

data1 — Reg(Rs), data2 — Reg(Rt) 

ALU result — func(data1, data2) 

Reg(Rd) — ALU result 

РС <— РС + 4 

Instruction — MEM[PC] 

data1 — Reg(Rs), data2 — Extend(imm16) 
ALU result — op(data1, data2) 

Reg(Rt) — ALU result 

РС — РС + 4 

Instruction — МЕМ[РС] 

data1 — Reg(Rs), data2 — Reg(Rt) 

zero — subtract(data1, data2) 

If (zero) PC — PC + 4 + 4xsign_ext(imm16) 
else РС — PC + 4 


Instruction — MEM[PC] 

base — Redírs) 

address — base + sign. extend(imm 2) 
data — MEM[address] 


Reag(rt) — data 
PC — РС +4 


Instruction — MEM[PC] 

base — Redí(rs), data — Rea(rt) 
address — base + ѕіап extend(imm 9) 
MEM[address] — data 


PC — PC * 4 

Instruction — MEM[PC] РА 
target — PC[31:28] || ааагеѕѕ2 || "00 
PC — target 


Spring 2019 


208 


MIPS Implementations 


Two MIPS implementations will be studied 
B A simplified version 
Ш A more realistic pipelined version 


Any instruction set can be implemented in many different ways: - 


Y Single cycle: АП “steps” of executing an instruction are done in one clock cycle. 
The cycle 15 long to accommodate longest path. 
< Advantage: One clock cycle per instruction | | | 
— 


Execute an ; 
entire instruction 


< Disadvantage: long cycle time 






v Multi cycle: steps (cycles) to execute instruction. 
<> break fetch/execute cycle into multiple steps 


< perform 1 step in each clock cycle 


v Pipelining lets a processor overlap the execution of several instructions, 
potentially leading to big performance gains. 

< execute each instruction in multiple steps 

< perform 1 step / instruction in each clock cycle 

« process multiple instructions in parallel 


oingle-Cycle 


a 100 ZUM GOOD +00 tou eoo TOD BOO oo ли 1100 (1200 1300 47400 1500 1600 1700 1800 190 


Time (ps) 


Fetch Decode Execute Kenney write 
Instruction Read Reg JAUL LI Read / VW rite Reg 


Pipelined 




















Fetch 
Instruction 


my 
I 
> Fetch | Decode Execute Ae гтл гуе 
Imstructiom Read Reg aL LS | FXececdwwiririte 
Fetch Decode Execute 
Instruction Read Reg JL UI 















Pdeunvorny Wl rite 
Reads/VV rite Reg 
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Logic Design Basics 


В Information encoded in binary 

E Low voltage = 0, High voltage = 1 

В One wire per bit 

E Multi-bit data encoded on multi-wire buses 
E Combinational element 

E Operate on data 

E Output is a function of input 
Е State (sequential) elements 

E Store information 


Review: Two Types of Logic Components 


Inputs Combinational | Outputs 
(ж? Circuit | z= = fX) 





Combinational Circuits 


_ I | External 
External Combinational Outputs 
Inputs CAC 






Pet 
State 


unre ГҮ 
State 


Flement infernal 


Inputs 


infernal 
Cutout 


Sequential Comcuit 
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Combinational circuits 


OpSelect 

- Add, Sub, ... 

- And, Or, Xor, Not, ... 
- GT, LT, EQ, Zero, ... 


- Mux, Demux, Decoder, ALU, 







- Result 


. Comp? 





Sequential Elements 


(Flipflop, Register, Register file, SRAM, DRAM) 


Registers are implemented with arrays of D-flipflops 

Registers contain (store) data 

Uses a clock signal to determine when to update the stored value 
- Edge-triggered: update when CLK changes from 0 to 1 


All state elements together define the state of the machine 

















CIk 
D Q 
D 
СІК a 
Cl 
En En 
СІК 
Q 0; Q, 0, Q | 
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Clocking methodology : 


The approach used to determine when data is valid and stable relative to 


the clock. 


Clocking Methodology 


<• Clocks are needed іп a sequential 


<. We assume edge- 


logic to decide when a state element triggered clocking 


(register) should be updated 


x To ensure correctness, a clocking 
methodology defines when data can 


be written and read 


clock 


| rising edge 1 falling edge 4 





< АП state changes 
occur on the same 
clock edge 


“» Data must be valid 
and stable before 
arrival of clock 
edge 


= Edge-triggered 
clocking allows а 
register to be read 
and written during 
same clock cycle 


Determining the Clock Cycle 


<+ With edge-triggered clocking, the clock cycle must be 
long enough to accommodate the path from one register 
through the combinational logic to another register 
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Тока: clock to output delay 
through register 


Tmax comb: longest delay 
through combinational logic 


Ts : setup time that input to a 
register must be stable 
before arrival of clock edge 


Ты: hold time that input to a 
register must hold after 
arrival of clock edge 


Hold time (T4) is normally 
satisfied since Tong > Tn 
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Single-cycle Implementation 
First Step: Building a Datapath 





Next Step: Implementing Control 


» Useasingle long clock cycle for every instruction. 
> This approach is much slower than a multi-cycle implementation where different 
instruction classes can take different numbers of cycles. 
> In asingle-cycle implementation, every instruction must take the same 
amount of time as the slowest instruction take the same amount of time 
as the slowest instruction. 
» In a multi-cycle implementation this problem is avoided by allowing 
quicker instructions to use fewer cycles. 
MIPS makes it easier 
> Instructions same size 
> Source registers always in same place 
> Immediates same size, location 


> Operations always оп registers/immediates 


Clk 
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Single cycle datapath => СРІ- 1, ССТ => long 


Single Cycle Datapath and Control 


32 Jump or Branch Target Address 






Мех! | J, Beg, впе 
РС Ñ" ALU result 








PCSrc 


Instruction 
Memory 


Instruction 


p Address 


EB 








RegDst RegWrite ExtOp ALUSre ALUCHrI 


func 
Op 
MemRead 


1 


|, | AL UOp MemVvrite ІМ emtoHeg 
| Main 
Control 


The single-cycle datapath conceptually described in this section must have 


Note 


separate instruction and data memories because 

1. Тһе format of data and instructions is different in MIPS and hence 
different memories are needed. 

2. Having separate memories is less expensive. 

3. [he processor operates in one cycle and cannot use a single- 


ported memory for two different accesses within that cycle. 
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Datapath consists of the functional units of the processor. 
° Elements that hold data: Program counter, register file, instruction memory, etc. 

° Elements that operate on data: ALU, adders, etc. 

* Buses for transferring data between elements. 


Components of the Datapath 


“» Combinational Elements 
<- ALU, Adder 


+ Immediate extender 





< Multiplexers 


> Storage Elements Instruction 
< Instruction memory pass 
Instruction 


< Data memory Memory 
<- PC register 
< Register file 


MemRead MemVvrite 


e — methodology 


MIPS Register File 


“» Register File consists of 32 x 32-bit registers 
< BusA and BusB: 32-bit output busses for reading 2 registers 


— = 


< BusW: 32-bit input bus for writing a register when RegVVrite Is 1 

< Two registers read and one written in a cycle 
“» Registers are selected by: 

+ RA selects register to be read оп BusA 

+ RB selects register to be read on BusB 

+ RW selects the register to be written 





RegWrite 
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Clock is 
ReadSel1 Je 7/5 га Ы ---- ReadDatal 
ReadSel2 гер Register > ReadData2 
file 
WriteSel ws - 4 XA 
WriteData - - wid Се 
ws clk wad rsi 





=f resister __ 


register 1 
= 
register 31 


e No timing issues in reading a selected register 
e Register files with a large number of ports are difficult 

to design 
e a Read сап ре done any time (i.e. combinational ) 
e a Write is performed at the rising clock edge 

if it is enabled 

= the write address and data 
must be stable at the clock edge 


Wwe —— 











32 
Clock HRegVWrite =>» DBusB 
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Read register 
number 1 


Read data 1 


Read register 
number 2 


Read data 2 





Write 


Register 0 
Register number 


Hegister 1 


Register n— 2 


Register п- 1 


Register data 
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Next, we һауе the program counter or РС. 


The PC is a state element that holds the address of 
the current instruction. Essentially, it is just a 32-bit 
register which holds the instruction address and is 
updated at the end of every clock cycle. 


* Normally PC increments sequentially except for branch instructions 


The arrows on either side indicate that the PC 
state element is both readable and writeable. 





Instruction and Data Memories 


% Instruction memory needs only provide read access 


+ Because datapath does not write instructions 


< Behaves as combinational logic for read Address Instruction |a 
+ Address selects Instruction after access time Instruction 
| Memory 
% Data Memory is used for load and store 
+ MemRead: enables output on Data out Data 
s Address selects the word to put on Data out | Метогу 


Address Data out m 


< MemWrite: enables writing of Data in 


Data in 


« Address selects the memory word to be written 





= The Clock synchronizes the write operation 


% Separate instruction and data memories — шшк 


+ Later, we will replace them with caches 
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Building a Multifunction ALU 





ES None - 00 2 
== | <11-01 SLT: ALU does а SUB 
5 D SRL = 10 and check the sign 
= | and overflow 
O SRA = 11 Shift Amount 















A I I ALU Result 
= © B d -— 
ES { ADD = 0 “Ty 1 ). 
с. а> = — 
> 5 ae ' a m = 2 Zeno 
І 
Logic Unit ! ALU 
= Selection 
I 
| | Shift = 00 
= = AND = 00 | ! SLT = 01 
>w J OR=01 i D Arith = 10 
© E NOR = 10 2 ——————— ————— eR. Logic — 11 
O XOR = 11 
Overflow and 5171 
Result31= Sign 
С [None = 00 2 
= - SLL = 01 
б © | SRL- 10 
O SRA = 11 
| x _ Cdrry-Out overflow 
ERE | | ALUResut 
о с B 32 u В f | z 
= 5 | — B. Шы 
£ w[^ADD-0 d 1 J _ p» | Р Ñ 
Еа ovs = zero 
E : | ) ALU 
em | a = |. Selection 
| = | , Shift = 00 
- 5 [ AND =00 SLT = 01 
© = OR = 01 Arith = 10 
© т | NOR = 10 2 %5-------------------- =a Logic = 11 
o XOR 7 11 
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Details of the Extender 


* Two types of extensions 
+» Zero-extension for unsigned constants 


+» Sign-extension for signed constants 
% Control signal ExtOp indicates type of extension 


% Extender Implementation: wiring апа one AND gate 


ExtOp = 0 = Upper16 = 0 


ExtOp 






Upper 
16 bits ExtOp=1 => 


Upper16 = sign bit 





Lower 


mS 16 bits 





So now we have instruction memory, PC, and adder datapath elements. Now, 
we can talk about the general steps taken to execute a program. 


° Instruction fetching: use the address in the PC to fetch the current instruction 
from instruction memory. 

° Instruction decoding: determine the fields within the instruction 

° Instruction execution: perform the operation indicated by the instruction. 

* Update the PC to hold the address of the next instruction 
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Instruction Fetching Datapath 


“> We can now assemble the datapath from its components 
“> For instruction fetching, we need... 

<> Program Counter (PC) register 

<> Instruction Memory 


<> Adder for incrementing PC 


Increment PC by four after 
Lo reading current instruction - 
by four since an instruction 
| is 32 bits long 
Instruction read from the 
Send address from PC to the memory - send to rest of 


instruction memory to read the the data path 
instruction at IM[PC | 








Improved datapath 


increments upper 30 
bits of PC by 1 





The least significant 2 bits of next PC 
the PC are 00 since PC is 
a multiple of 4 


Improved 
Datapath 





Instruction ¿ia 
һы Address 


Instruction » 
a a Datapath does not 


handle branch or 


К : : Instruction 
jump instructions 


Memory 


Instruction 
Memory 
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Datapath for R-type Instructions 









Instruction 
Memory 


Instruction 


Address 
ALU result 


RA & RB come from the 


instruction's Rs & Rt fields ALU inputs come from BusA & BusB 
RW comes from the Rd field ALU result is connected to BusVW 


4$ Control signals 
<> ALUCtrl is derived from the funct field because Op = 0 for R-type 
<> HRegVVrite is used to enable the writing of the ALU result 


Datapath for I-type ALU Instructions 






ALUCtI 












Instruction 
Memory 


Instruction 


Address 
ALU result 


RW now comes from 


Se Insteam arro — Second ALU input comes 
from the extended immediate 


“ Control signals 
5> ALUCtrl is derived from the Op field 
<> Regvyvrite is used to enable the writing of the ALU result 


<> ExtOp Is used to control the extension of the 16-bit immediate 
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Combining R-type & I-type Datapaths 






ALUCtrI 


Another mux 
selects 2"d ALU 
input as either 


Instruction 
Memory 


Instruction 
source register 
Бі data оп BusB 
or the extended 
immediate 


Address 


A mux selects RW 
as either Rt or Rd 


* Control signals 
<> ALUCtrl is derived from either the Op or the funct field 
<> RegWrite enables the writing of the ALU result 





ALU result 


< ExtOp controls the extension of the 16-bit immediate 
<> RegDst selects the register destination as either Rt or Rd 


<> ALUSrc selects the 2"7 ALU source as BusB or extended immediate 


Controlling ALU Instructions 

















ALUCIrI 


For R-type ALU 
instructions, RegDst is 
'1 to select Rd on RW 

and ALUSrc is 'O' to 
select BusB as second 
ALU input. The active 
part of datapath is 
shown in green 





Instruction 
Memory 


Instruction 


Address 


* ALUSrc = 0 


ALU result 


For l-type ALU 
instructions, RegDst is 
'O to select Rt on RW 

and ALUSrc is '1' to 
select Extended 
immediate as second 
ALU input. The active 
part of datapath is 
shown in green 





Instruction 
Memory 


Instruction 


Address 


ALU result 





Adding Data Memory to Datapath 


“» A data memory is added for load and store instructions 


ExtOp ALUCtri MemRead  MemwVvrite 






ALUSrc MemtoRe 
ALU result à g 


Instruction 
Memory 


Instruction 


Address 





LM 
ин Hedqvvrite 


ALU calculates data memory address A 3" mux selects data on BusW as 


either ALU result or memory data out 


«* Additional Control signals 


BusB is connected to Data in of Data 


<> MemRead for load instructions | 
Memory for store instructions 


<> MemwWVVrite for store instructions 


<= MemtoReg selects data оп BusW ав ALU result or Memory Data out 


Controlling the Execution of Load 


ExtOp = 1 ALUCHI wemRead MemWrite 
|». ALUSr “АП -1 =0 
Imm18 ињ 32 =1 MemtoReg 
| | ALU result - 4 





RegDst = 0” selects Rt HegWrite = ‘T to enable ExtOp = 1 to sign-extend 
as destination register writing of register file Immmediate 16 to 32 bits 


ALUSre = 1 selects extended ALUCtri = ‘ADL’ to calculate data memory 
Immediate as second ALU input address as Reg(Rs) + sign-extend(Imm16) 


VMemtoReg = 1 places the data Clock edge updates PC 





read from memory on BusVV and Hegister Ht 
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sw  rt,im'é(rs)| | store word | Ox2b | rs? | г | im 6 


Controlling the Execution of Store 


ExlOp = 1 АОС MemRead MemWrite 





RegDst = X because RegWrite = 1) to disable ExtOp = 1 to sign-extend 
no register is written writing of register file Immmediate16 to 32 bits 


ALUSre = 1 selects extended ALUCtrl = ADD' to calculate data memory 
Immediate as second ALU input address as Reg(Rs) + sign-extend(lmm16) 


MemWrite = '1' to MemtoReg = X because don't Clock edge updates PC 
write data memory care what data is put оп BusVW and Data Memory 
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Adding Jump and Branch to Datapath 


32 Jump ог Branch Target Address 









Instruction 
Memory 
Instruction 






ExtOp ALUS ALUC Read Write toReg 


“» Additional Control Signals 


<+ J, Beg, Bne for jump and branch instructions Next PC logic 
| . . computes jump or 
*- Zero flag of the ALU is examined branch target 


4- PCSrc - 1 for jump & taken branch instruction address 





Details of Next PC 
Branch or Jump Target Address 


s шы м. м. мн маң нен шеш c t Ko X | — ci i d RR RR EE (мен 


osign-Extension: 






Most-significant 


bit is replicated Beq 


Bne 








J 


Zero 
Imm16 is shifted left 2-bits being 18 bit then sign-extended to 32 bits 


Jump target address: upper 4 bits of PC are concatenated with Imm26 
bit after shifting by 2 to be (Most 4 bit of PC, 28bit) 
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DATAPATH FOR J-FORMAT 


Jump Address [31-0] Jump Address [27-0 
Here, we have modified the datapath I [27-0] 


to work only for the J instruction. 


' РС [31—28 
1 targaddr | | 


Read 
| address 


Instruction [25-0] 20 | Shift ^ 28 


Instruction left 2 


Instruction 
Memory 





Controlling the Execution of Jump 


32 Jump Target Address 







Метуугйе 
=O =0 


MemtoReg 
ALU result =X 


MemRea 
d 


= Imm16 









Instruction 
Memory 
Instruction Registers 
RB BusB 32 
Address Data out| / 


RW Busw 


J = 1 selects Imm26 as | ResDst Regwrite 
jump target address =x “O Extop ALUSrc АШСЇП J= 1 


=X =X =X 





Upper 4 bits are from 


the incremented PC MemRead, MemWrite & RegWrite are 0 


We don't care about RegDst, ExtOp, 








PCSrc = 1 to select 
jump target address 


-— ш 9 5 "aas чип ағ s жағ uw zp" """P» nw ow 


ALUSrc, ALUCtrl, and MemtoReg 
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DATAPATH FOR BRANCH INSTRUCTIONS 


— 


Ш Read register operands PC +4 from instruction datapath — ~~ 


| ; Branch 
———____ a Add Sum target 


| Shift | ке, 


Ш Compare operands 


m Use ALU, subtract and 
check Zero output 






Head ALLI operation 
register 1 --4 





Instruction A 
Read = 
m Calculate target address register 2 | To branch 
2ALU Zero 7e control logic 





B Sign-extend displacement i | = 


B Shift left 2 places (word 
displacement) 


B Add to PC +4 


Already calculated by instruction fetch 


Controlling the Execution of Branch 


32 Branch Target Address 






me Memwrite 
= 0 = 
IemtoReg 
ALU result 
Instruction 
Memory 
Instruction 
32 
Data out| / 





Address 





Rd 
/ 
5 


2 


HegDst  HegVVvrite 
Either Веа or Bne -1 Eo =O ExtOp — ALUSrc ALUCItH Вед = 1 
—X = ü = SUB Впе = 1 
Next РС outputs branch target address 





ALUSrc = ‘0’ (2™7 ALU input is BusB) Next PC logic determines PCSrc 
ALUCtrl = ‘SUB’ produces zero flag according to zero flag 
MemRead = MemWrite = RegWrite = 0 RegDst = ExtOp = MemtoReg = x 
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SINGLE-CYCLE CONTROL 


Now we have a complete datapath for our simple MIPS subset. We will add 
the control. 


The control unit is responsible for taking the instruction and generating the 
appropriate signals for the datapath elements. 


Signals that need to be generated include:- 

* Operation to be performed by ALU. 

* Whether register file needs to be written. 

* Signals for multiple intermediate multiplexors. 

* Whether data memory needs to be written. 

For the most part, we can generate these signals using only the opcode and 
funct fields of an instruction. 


Single-Cycle Datapath + Control 


32 Jump or Branch Target Address 





Next] |.) вед, Bne 
ral 


Instruction 
Memory 
Instruction 


Address 


RegDst RegWrite ExtOp ALUSrc ALUCtri 


| func t 
Ор 
қ MemRead 
T> ‚ AL UOp Метуугйе М emtoReg 
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Main Control апа ALU Control 


Instruction 


Memory Datapath 


Instruction ww 


Address 


A 
= 
= 
аз 
= 





< 6-bit opcode field from instruction 
Output: 

<> 10 control signals for datapath 

<> ALUOp for ALU Control 


! <> 6-bit function field from instruction 


<> ALUCtrl signal for ALU 


І 
І 
І 
І 
І 
І 
< ALUOp from main control | 
І 
І 
І 
І 
І 
І 


Output: 


The Main Control Unit 


О Control signals derived from the instruction 


R-type funct 


31-26 25-21 20-16 15-11 10-6 5-0 


Load/stoe| opcode | в | rt | | addres —— 0 


branch | орсоде address 
31-26 25-21 20-16 15-0 


always Read for Write for R-type Sign-extend 
read R-type and load and add 


and branch 





(1 destination register for load instruction is in bits 20-16 (rt) while for 


R-type is in bits 15-11 (rd) (will require multiplexor to select) 
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Main Control Signals 


Signal | Effect when '0' | Effect when ‘1’ 
RegDst Destination register = Rt Destination register = Rd 

Е | Destination register is written with 
ExtOp 16-bit immediate is zero-extended 16-bit immediate is sign-extended 


ALUSrc Second ALU operand comes from the | Second ALU operand comes from 
| second register file output (BusB) the extended 16-bit immediate 


w Data memory is read 
Data out — Memory[address] 
| Data memory is written 
Memory[address] — Data іп 


IMemtoReg | BusW = ALU result BusW - Data out from Memory 


PC — Branch target address 
icm If branch is taken 
РС —PC-4 PC — Jump target address 


This multi-bit signal specifies the ALU operation as a function of the opcode 





Main Control Signal Values 


Reg Reg Ext ALU ALU | Mem Mem 
Dst | Write Ор Src Op | < Write | toReg 


сз тт | ADD | 0 | 
сеют | emm | etr | O ` 
[o=zero| 1=imm | OR | O ` 
O=zero| 1=Imm | xor | 0 _ 
т=зюп | тт | ADD | 0 ` 
sign | 1-mm | ADD | 0 | 

x 


[x |о-вшв| SUB | O ` 
Tx [x [x fo. 
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Logic Equations for Control Signals 


RegDst <- 
RegWrite <= 
ExtOp <= 
ALUSrc <= 
MemRead <= 
MemWrite <= 


MemtoReg <= 
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R-type 


ви" рец ле) | селе | 





R-type 
addi 
siti 
andl 
ori 
xor 
ІМ 
SW 


(R-type + beq + bne) | | 
Logic 





lw | Equations 

қ. s 2 ов ФР 

u 55588%ЕФс». 
= 9 OF 3S E E Emm 
« Y > a o © o 

Iw ЕЕЕ! 

С із тірісі 
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ALU Control 
ALU used for 
B | oad/Store: Function = add 


B Branch: Function = subtract 


B H-type: Function depends on funct field 


0111 set-on-less-than 


B Assume 2-bit ALUOp derived from opcode 





B Combinational logic derives ALU control 


instruction instruction Desired ALU control 
opcode ME operation Funct field ALU action Input 


Branch equal branch equal ХААХАХ subtract O110 
R-type | add 100000 | add | 0010 
R-type 1 subtract 100010 ! subtract 0110 
R-type AND 100100 | AND | OOOO 
i. —1 за set on less than 101010 set on less than 0111 
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ALUOp Funct field 





Operation 
— EU 



















































































0 0010 

1 X 0010 

i ign ora 

1 X 0000 

1 X 0001 
-і EE SEESESESESESES i — 


Logic Equation for ALUctr2 


bit<1>  bit=0> ]bit-5- bit<4> bit-3-  bit-2- 1 bit=0> ALUctr=2> 


х а 
_ A x | x x yox 0o 34 о) 4 2 
| 10 хх | x (x (4) 0o 34 од) 4 , 


— This makes func<3> a don't care 





ALUctr2 = 


_Legic Equation for ALUctr1 


func 
bitc5- bitcd- bit<3> bit<2> bite Біт О ALUctr=1> 





bits 1> bit-0- bit-5- bit<4> bit-c3-  bit-2- DULL DM ALUctr«ü- 
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Тһе ALU control block generates the four ALU control bits, based on the 
function code and ALUOp bits 


ALUOp 


ALU control block 
ALUOp0 
ALUOp1 Operation3 


Operation2 


| > Operation 
Operation1 
F (5-0) 


OperationO 





Main Control 


Setting of the control signals 


RegDst | ALUSrc | Memto | Reg Mem Mem Branch | ALU ALU 
Reg Write Read Write Opt Оро 


К type 
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ШШ k ее ШЕ. „ЖЕ ШЕЖЕ 
w | 9 [ài | i | ° 

ш Is 

"n. к 

8 — — БИСИ ЕИ ЕИ СИ 
m ae 


pum 

ALUSre 

| MemtoReg 

| RegWrite 

Outputs | MemRead 
| Mem Write 
| Branch 
ALUOp1 
ALUOpO 





e 
сі» 








4 
ІШІ 


= | = 
e| = 


- 


= 


5 


= | 4. 





Control Unit PLA Implementation 





Inputs 

Op5 

Op4 

Op3 

Op2 

Op1 

OpO 
Outputs 

R-format 
| RegDst 

ALUSrc 
MemtoReg 
RegVWrite 
MemRead 
MemW rite 
Branch 





ALUOp1 
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Drawbacks of Single Cycle Processor 


* Long cycle time 
<- All instructions take as much time as the slowest 


ALU Instruction Fetch 
«———————————— |ongest delay —____LL# 


Load Instruction Fetch Memory Read 
Store [Instruction Fetch Memory Write 


Branch [Instruction Fetch 


Jump [Instruction Fetch 







« Alternative Solution: Multicycle implementation 


<> Break down instruction execution into multiple cycles 
Single-Cycle vs. Multicycle MicroMIPS 


wx | [ jJ | LL 1 






Time 
allotted 





nstr 1 Instr 2 





Clock 


nme | | 
Sr | 
3 cycles 2 5 cycles 3 cycles 
| Instr 2 Instr 3 


AO LIL S 


Instr 4 





allotted Кы. 5 
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Worst Case Timing (Load Instruction) 


СІК | 


-> <- CIk-to-q 


TE 


5 Instruction Memory Access Time 


Old Instruction New Instruction = (Op, Ks, Rt, Rd, Funct, 11116, Imm26) 


| — Delay Through Control Logic 
Old Control Signal Values | New Control Signal Values (ExtOp, ALUSrc, ALUOp, 1.) 
! ы Register File Access Time 


Old BusA Value | New BusA Value = Register(Rs) 


Delay Thróugh Extender and ALU Mux | 
Old Second ALU Input | New second ALU Input = sign-extend(Imm76) 


| | i—— > ALU Delay | 
Old ALU Result А New ALU Result = Address 
| Data Memory Access Time + | 
Old Data Memory Output Value New Value 
Mux delay + Setup time + Clock skew | TUE 
+ Occurs 


Clock Cycle | 


** Long cycle time: must be long enough for Load 
operation 
PC's CIk-to-Q 
+ Instruction Мегпогу s Access lime 
+ Maximum of ( 
Register File’s Access Time, 
Delay through control logic + extender + ALU mux) 
+ ALU to Perform a 32-bit Add 
+ Data Memory Access Time 
+ Delay through MemtoReg Мих 
+ Setup Time for Register File Write + Clock Skew 
4» Cycle time is longer than needed for other instructions 
<= Therefore, single cycle processor design is not used іп practice 
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Summary 


*” 5 steps to design a processor 
< Analyze instruction set => datapath requirements 
< Select datapath components & establish clocking methodology 
< Assemble datapath meeting the requirements 
< Analyze implementation of each instruction to determine control signals 


< Assemble the control logic 


“ MIPS makes Control easier 
< Instructions are of same size 
< Source registers always in same place 
+ Immediates are of same size and same location 


< Operations are always on registers/immediates 


* Single cycle datapath => CPI=1, but Long Clock Cycle 


Single-cycle Design Problems 
О Assuming fixed-period clock, every instruction datapath uses one 
clock cycle implies: 
О СРІ-1 
Q Clock period is determined by length of the longest instruction 
path (critical path: load instruction) 
О Instruction memory — register file + ALU — data memory > register file 


Y биќ several instructions could run in a shorter clock cycle: waste of time 


Y consider if we have more complicated instructions like floating point! 


—.4-----QgQ —-—- 
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Single-Cycle Performance Example 


Element Parameter — (ps) 
Register clock-to-Q 
Multiplexer — —h— — 





AC - Ly са РС + H tt *"RFread rt “ALU + fux T Гер setup 
= [30 + 2(250) + 150 + 200 + 25 + 20] ps 


= 925 ps 





ж For a program with 100 billion instructions executing 
on a single-cycle MIPS processor, 
e Execution Time 
= Num. ofinstructions x CPI x Te 
= (100 x 10°)(1)(925 x 10:75) 
- 92.5 seconds 
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Fixed-period clock vs. variable-period clock 


О Example 
О Consider a machine with an additional floating point unit. Assume 
the functional unit delays as follows: 
v Mem.: 2 ns, ALU: 2 ns, FPU add: 8 ns, FPU multiply: 16 ns, register 
file access (read or write): 1 ns. 
v  multiplexors, control unit, PC accesses, sign extension: no delay 
О Assume instruction mix as follows: 
v A Loads:319^, Stores: 21%, R-type: 27%, branches: 5%, jumps 2%, FP: 7% 
О Compare the performance of a single-cycle implementation using: 
v  afixed-period clock 


v a variable-period clock where each instruction executes in one clock 


cycle that is only as long as it needs to be 





О Solution 

Instruction Instr. Reg. ALU Data Reg. FPU FPU Total 
class mem. read oper. mem. Write add/sub mul/div time (ns) 
Load word 2 1 2 2 1 8 
Store word 2 1 2 2 T 
R-format 2 1 2 0 1 6 
Branch 2 1 2 5 
Jump 2 2 

FP add/sub 2 1 8 12 

FP mul/div 2 1 1 16 20 





О Clock period for fixed-period clock = longest instruction time = 20 ns. 


Averaqe clock period for variable-period clock = 
8 x 31% + 7 x 21% + 6 x 27% + 5 x 5% + 2 x 2% + 20 x 7% + 12 x 7% = 7 ns. 


О Therefore, performancey4. ева /регТогтапсеғ,-4-регог = 20/7 = 2.9 
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Multicycle Implementation 


Multicycle implementation Also called multiple clock cycle 
implementation. An implementation in which an instruction is 


executed in multiple clock cycles. 
m Single-cycle microarchitecture: 
+ simple 
cycle time limited by longest instruction (1w) 
two adders/ALUs and two memories 


m Multi-cycle microarchitecture: 
* higher clock speed 
* Simpler instructions run faster 
* reuse expensive hardware on multiple cycles 


= Same design steps: datapath & control 


Merging Logic from Single Cycle to MultiCycle 
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What Do We Want То Optimize 
m Single Cycle Architecture uses two memories 
= One memory stores instructions, the other data 


= We want to use a single memory (Smaller size) 


m Single Cycle Architecture needs three adders 
" ALU, PC, Branch address calculation 
= We want to use the ALU for all operations (smaller size) 


m In Single Cycle Architecture all instructions take one cycle 
= The most complex operation slows down everything! 
= Divide all instructions into multiple steps 


= Simpler instructions сап take fewer cycles (average case may be faster) 


Multicycle Execution - Key Idea 
Break instruction execution into multiple cycles 


One clock cycle for each task 
1. Instruction Fetch 
2. Instruction Decode and Register Fetch 


3. Execution, memory address computation, or 
branch/jump completion 


4. Memory access / R-type instruction completion 
5. Memory read completion 
Share hardware to simplify datapath 
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Characteristics of Multicycle Design 


Instructions take more than one cycle 


= Some instructions take more cycles than others 
Clock cycle is shorter than single-cycle clock 
Reuse of major components simplifies datapath 
Single ALU for all calculations 

Single memory for instructions and data 


- But, added registers needed to store values across 
cycles 


Control Unit Implemented by Finite State Machine 


= Control signals no longer a function of just the 
instruction. 





l 





Instruction 


register | 
Data 
Address — 


| Register # 
ш Instruction 
Memory or data Registers 


-| Register # 














ALU-Reqg. 
“| Data ! 
Register # 





The multicycle implementation allows a functional unit to be used more than once per instruction, as long as 
it is used on different clock cycles. This sharing can help reduce amount of hardware required. The ability 
to allow instructions to take different numbers of clock cycles and the ability to share functional units within 
the execution of a single instruction are the major advantages of a multicycle design. 


Тһе use of shared functional units requires the addition or widening of multiplexors as well as new 
temporary registers that hold data between clock cycles of the same instruction. The additional 
registers are the: Instruction register (IR), Memory data Register (МОК) and A, Б, and 
ALUOut. 


The IR needs to hold the instruction until the end of execution of that instruction, and thus will require a 


write control signal. All the registers except the IR hold data only between a pair of adjacent clock cycles 
and will thus not need a write control signal. 


LY Between steps/cycles 


v At the end of one cycle store data to be used in later cycles of the 
same instruction 
- need to introduce additional internal (programmer-invisible) 
registers for this purpose 
м“ Data to be used іп later instructions are stored іп programmer-visible 


state elements: the register file, PC, memory 
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Multicycle Datapath for MIPS Handles the Basic Instructions 


Handling the additional inputs requires two changes to the datapath: 
]. An additional multiplexor is added for the first ALU input. The multiplexor chooses between the 


A register and the PC. 


2. The multiplexor on the second ALU input is changed from a two —way to a four-way multiplexor. 
The two additional inputs to the multiplexor are the constant 4 (used to increment the PC) and the 


sign-extended and shifted offset field (used in the branch address computation). 


3. By introducing a few registers and multiplexors, we are able to reduce the number of memory 
units from two to one and eliminate two adders. Since registers and multiplexors are fairly small 


compared to a memory unit or ALU, this could yield a substantial reduction in the hardware cost. 





























lorD MemRead MemWrite IRWrite RegDst RegWrite ALUSrcA 
Al а | l || 
| Instruction | | Read 
| Address |25-21) register 1 Read |... 
| Instruction data 1 | 
Метогу [20-16] | | i EP 1 Zero > 
MemData | Instruction Registers 7 ALU ALU| 
[15-0] [f instruction | Write Read | 17 | result 
Write 115-11] register — 4 a2 В| | 
data Instruction 1) Ба 
register _ Write 
| data 
Instruction | 
[15-0] 
a ÁN 
— | Sign |] / shift! 
| e extend | ^ (Іей2) | 
| / NE 
\ J Y 
Instruction [5-0] 
E 








MemtoReg ALUSrcB ALUOp 
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The complete datapath for the multicycle implementation together 
with the necessary control lines. 









PCSource 





PCWriteCond 
PCWrite 





Outputs 


ALUSrcB 










MemhRead Control 


MemWrite 







[5-0] - 





MemtoReg 
IRWrite 























Instruction NE 
[31—26] || 
Instruction Read 
[25—21] register 1 Read | 
Read data 1 











=| Address 





Instruction 








— [20—16] register 2 
MemData Instruction =| ALUOut 
[15-0] 
Write 
Instruction 


data 








register 








shift 
left 2 / 






control 


Instruction [5—0] 
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Actions of the 1-bit control signals 
Signal name | Effect when deasserted Effect when asserted 
RegDst The register file destination number for the Write | The register file destination number for the Write register comes from the | 
register comes from the rt field. rd field. 


RegWrite None. The general-purpose register selected by the Write register number is 
written with the value of the Write data input. 


ALUSCA (The first ALU operand is the PC. The first ALU operand comes from the À register. 


ша None. Content of memory at the location specified by the Address input is put 
on Memory data output. 


MemWrite Memory contents at the location specified by the Address input is 
replaced by value on Write data input. 


MemtoReg The value fed to the register file Write data input | The value fed to the register file Write data input comes from the MDR. 
comes from ALUQut. 
The PC is used to supply the address to the ALUOut is used to supply the address to the memory unit. 
reren unit. 

The output ofthe memory is writen into the R 

None. The РС is written; the source is controlled by PCSource. 

None. The PC is written if the Zero output from the ALU is also active. 





Actions of the 2-bit control signals 
Signal name | Value (binary) | Effect 
The ALU performs an add operation. 


PCSource 


The jump target address (IR[25:0] shifted left 2 bits and concatenated with 
PC + 4[31:28]) 15 sent to the PC for writing. 
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Five Stages of Instruction Execution 


Instruction Fetch and PC increment 
Instruction Decode and Reqister Fetch (and branch tarqet calculation) 


One of the following: 


= Execute R-Type Instruction OR Calculate memory address for load/store 
OR Perform comparison for branch OR Jump completion. 


Memory access for load/store OR R-type instruction completion (save result) 


Memory read completion (save result - load only) 


The Five Cycles of MIPS 





(Instruction Fetch) 
IR:= Memory[PC] 
РС:- PC+4 
(Instruction decode and Register fetch) 

A:= Reg[IR[25:21]], B:=Reg[IR[20:16]] 
ALUout := PC + sign-extend(IR[15:0]] 
(Execute|Memory address|Branch completion) 

Memory reference: ALUout:= A+ IR[15:0] 

R-type (ALU): ALUout:= A op B 

Branch: if A=B then PC := ALUout 
(Memory access | R-type completion) 

LW: MDR:= Memory[ALUout] 

SW: Memory[ALUout]:= B 

R-type: Reg[IR[15:11]]:= ALUout 
(Writeback) 

LW: Reg[[20:16]]:= MDR 
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Notes:- 
- Not all instructions require all the steps. 
- Each step takes one clock cycle. 
- Each MIPS instruction takes from 3 - 5 cycles (steps). 


Instruction fetch IR = Memory[PC] 
РС= РС +4 
Instruction A = Reg [IR[25-21]] 
(2) decodelregister fetch B = Reg [IR[20-16]] 
` ALUOut = PC + (sign-extend (IR[15-0]) << 2) 


Execution, address ALUOut = А op B ALUOut = A + sign-extend | if (A ==В) then | PC = PC [31-28] | 
(3) |computation, branch/ (IR[15-0]) PC=ALUOut  (IR[25-0]««2) 
jump completion 


Memory access or R-ty|  Reg[IR[15-11]] = | Load: МОК = Memory[ALUOu 
(4) completion ALUOut or 
store: Memory [ALUOut] = B 


D) Memory read completion — Load: Reg[IR[20-16]] = МОК 











Why intermediate registers? 


Sometimes we need the output of a functional unit in a later clock 


cycle during the execution of an instruction. 
(Example: The instruction word fetched in stage 1 determines the destination of the register 
write in stage 5. The ALU result for an address computation in stage 3 is needed as the 


memory address for |w or sw in stage 4.) 


These outputs must be stored in intermediate registers for future use. 


Otherwise they will be lost by the next clock cycle. 
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(Instruction read іп stage 1 is saved іп Instruction register. Registerfile outputs from stage 
2 are saved in registers А and B. The ALU output will be stored in a register ALUout. Any 
data fetched from memory in stage 4 is kept in the Memory data register MDR.) 


STEP 1 


О Instruction Fetch & PC Increment (IF): 

Use PC to get instruction and put it in the instruction register. 
Increment the PC by 4 and put the result back in the PC. 

Can be described using RTL (Register-Transfer Language): 


D Ú Ú 


IR = Memory[PC]; 
PC = PC + 4; 





Instruction I 


MemWrite 
DR 
Memo 

гу RD 


WD 
MemRead 
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STEP 2 


О Instruction Decode and Register Fetch (ID): 
(1 Read registers rs and rt in case we need them. 
(1 Compute the branch address in case the instruction is a branch. 


О RTL: a = Reg[IR[25-21]]1; 
B = Reg[IR[20-16]]; 
ALUOut = PC + (sign-extend(IR[15-0]) << 2); 



















' CONCAT 
Hc usc 
: MUX 
N2 W 





MemWrite 
ADDR 






MemRead 
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STEP 3 


О Execution, Address Computation or Branch Completion(EX): 
(1 ALU performs one of four functions depending on instruction type: 
1. memory reference: ALUOut = А + sign-extend(IR[15-0]); 


IRWrite 


Instruction I — ири я 
СОМСАТ 
pe 

Re ira 


RM1 RN2 WN 












MéemWirite 
ADDR 


Memo 
ry RD 






Registers , 






WD 
MemRead 


Q Execution, Address Computation or Branch Completion(EX): 
2, R-type: ALUOut = А op B; 


IRWrite 









MemwWrite 
ADDR 


Memo 
Li RD 







WD 





MemRead 






immediate 
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О Execution, Address Computation or Branch Completion(EX): 
3. branch (instruction completes): if (A--B) PC = ALUOut; 


IRWrite 


MemRead 





(1 Execution, Address Computation or Branch Completion(EX): 


4. jump (instruction completes): pc = Pc[31-28] || (IR(25-0) << 2) 
PC = PC[31-28] concat (IR[25-0] << 2) 
IRWrite 






MemWrite RN1 R 


ADDR : 
Registers 
Memory 
RD 






RD1 


WD 
MemRead 


immediate 





Dr. Ahmed Jaber Spring 2019 


253 


STEP 4 
LJ Memory access ог R-type Instruction Completion (МЕМ): 
О Again depending on instruction type: 


1. Loads and stores access memory 
LJ Load: MDR = Memory [ALUOut]; 


IRWrite 















MemwWrite 
ADDR 


Memo 
ry RD 


WD 
MemRead 


(1 Store (instruction completes): wemory[ALUOut] = B 


F 


IRWrite 





RN1 КМ2 W 


Registers 


MemRead 


immediate 
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(1 Memory access ог R-type Instruction Completion (МЕМ): 
2. R-type (instruction completes): Reg[IR[15-11]] = ALUOut; 


Reg[IR[15:11]] = ALUOut; (Reg [Rd] = ALUOut) 
IRWrite 


MemRead 





STEP 5 
О Memory Read Completion (WB): 
О Load writes back (instruction completes) Reg{IR[20-16]]= MDR; 


Reg[IR[20-16]] = MDR; 
IRWrite 


MemRead 
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Review: Finite-State Machines 


Inputs 


Digital logic systems can be classified as combinational or sequential. 


Sequential systems contain state stored in memory elements internal to the system. Their 
behavior depends both on the set of inputs supplied and on the contents of the internal 
memory, or state of the system. 


Thus, a sequential system cannot be described with a truth table. Instead, a sequential 
system is described as a finite-state machine (or often just state machine). 


A finite-state machine has a set of states and two functions called the next-state function 
and the output function. 


finite-state machine : A sequential logic function consisting of a set of inputs and outputs, 
a next-state function that maps the current state and the inputs to a new state, and an output 
function that maps the current state and possibly the inputs to a set of asserted outputs. 


next-state function : A combinational function that, given the inputs and the current state, 
determines the next state of a finite-state machine. 


>| Current state | ( Next-state 
N function 


Clock 


Output 
function 


Outputs 





А state machine consists of internal storage that contains the state and 

two combinational functions: the next-state function and the output function. Often, the 
output function is restricted to take only the current state as its input; this does not change the capability of 
a sequential machine, but does affect its internals. 
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Multicycle Control 


° Single-cycle control used combinational logic 
е Multi-cycle control uses ?? 


е FSM defines a succession of states, transitions between 
states (based on inputs), and outputs (based on state) 


е First two states same for every instruction, next state 
depends on opcode 


Multicycle Control FSM 


start 








Instruction fetch 


Decode and Register Fetch 


















Memory 
instructions 


Jump 
instruction 


Branch 
instructions 


R-type 


instructions 


Instruction fetch 


Instruction decode/ 


Mem Read S — fetch 
„Метнеаа ~ о 
lorD = O N 
ALUSrcA = 







IRWrite | | 
Start ALUSrcB = 01 YC i ALUSrcB = i4 | 
ALUOp = 00 / & ALUOp = oo 
PCWrite i 





PCSource = ОО n 





Memory-reference FSM R-type FSM Branch FSM Jump FSM 


Complete finite State Machine Control for The Datapath 
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Instruction fetch 
Instruction decode: 


register fetch 







ex 
2 
o? > 
2“ 7 
e? © 
Memory address Branch Jump 
computation Execution completion completion 














ALUSrcA = 1 
ALUSrcB = 00 
ALUOp = 01 
PCWriteCond 
PCSource = 01 






ALUSrcA = 1 
ALUSrcB = 10 
ALUOp = 00 







Memory Memory 
access access R-type completion 





Memory read 
completon step 
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Performance Considerations 


Break instruction execution into five steps 
Instruction fetch. 
Instruction decode and register read. 


Execution, memory address calculation, or 
branch/jump completion. 


Memory access (SW) or ALU instruction completion 
Load instruction completion 
One step = One clock cycle (clock cycle is reduced) 


First 2 steps are the same for all instructions 


Instruction | # cycles | Instruction | # cycles 


Comparing cycle times 


» Suppose ALU latency is 3ns, register file latency 2ns, and 
Memory access (read or write) latency Зп. ignore the delays in PC, тих, 


extender, and wires. 


» Тһе clock period has to be long enough to allow all of the 
required work to complete within the cycle. 


y In the single-cycle datapath, the "required work" was just the 
complete execution of any instruction. 
» The longest instruction, lw, requires 13ns (3 +2 + 3 + 3 + 2). 
» So the clock cycle time has to be 13ns, for a / /MHz clock rate. 
> For the multicycle datapath, the “required work’ is only a single 
stage. 
» [he longest delay is 3ns, for both the ALU and the memory. 
» 5о our cycle time has to be 3ns, or a clock rate of 333MHz. 


» The register file needs only 2ns, but it must wait an extra 1ns to stay 
synchronized with the other functional units. 
» The single-cycle cycle time is limited by the slowest instruction, whereas 
the multicvcle cycle time is limited by the slowest functional unit. 
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Comparing instruction execution times 


In the single-cycle datapath, each instruction needs an entire 
clock cycle, or 13ns, to execute. 


With the multicycle CPU, different instructions need different 
numbers of clock cycles, and hence different amounts of time. 


A branch needs З cycles, or 3 x 3ns = Эп. 


- Arithmetic and sw instructions each require 4 cycles, or 
12ns. 


s Finally, a lw takes 5 stages, or 15ns. 
We can make some observations about performance already. 


Loads take longer with this multicycle implementation, while 
all other instructions are faster than before. 


. So If our program doesn't have too many loads, then we 
should see an increase in performance. 
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Performance Example 


- Assume the following operation times for components: 
Instruction and data memories: 200 ps 
ALU and adders: 180 ps 
Decode and Register file access (read or write): 150 ps 


Ignore the delays in PC, mux, extender, and wires 





- Which of the following would be faster and by how much? 
single-cycle implementation for all instructions 
Multicycle implementation optimized for every class of instructions 
- Assume the following instruction mix: 


40% ALU, 20% Loads, 10% stores, 20% branches, & 10% jumps 


Solution 


Instruction | Instruction | Register ALU Register 
Class Memory Read Operation Write 


ALU 200 


880 ps 


530 ps 
180 «— decode and update РС 530ps 


“+ For fixed single-cycle implementation: 





150 180 150 
1 | xw | — | — 


+ Clock cycle = 880 ps determined by longest delay (load instruction) 

% For multi-cycle implementation: 
<> Clock cycle = max (200, 150, 180) = 200 ps (maximum delay at any step) 
+ Average CPI = 0.4х4 + 0.2х5 + 0.1х4+ 0.2x3 + 0.1x3 = 3.9 

4 Speedup = 880 ps / (3.9 x 200 ps) = 880 / 780 = 1.13 
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example 


Arithmetic 


Loads 
Stores 
Branches 


Let's assume the instruction mix. 





In a single-cycle datapath, all instructions take 13ns to execute. 


The average execution time for an instruction on the multicycle 
processor works out to 12.09ns. 


(48% x 12ns) + (22% x 15ns) + (11% x 12ns) + (19% x 9ns) = 
12.09ns 


Тһе multicycle implementation is faster in this case, but not by 
much. The speedup here is only 13ns / 12.09ns = 1.075 


Overview of a Multiple Cycle Implementation 
° The root of the single cycle processor's problems: 
* The cycle time has to be long enough for the slowest instruction 
> Solution: 
* Break the instruction into smaller steps. 
* Execute each step (instead of the entire instruction) in one cycle 
- Cycle time: time it takes to execute the longest step 
- Keep all the steps to have similar length 
e This is the essence of the multiple cycle processor 
? Тһе advantages of the multiple cycle processor: 
* Cycle time is much shorter 


* Different instructions take different number of cycles to complete - Load takes five cycles - Jump 
only takes three cycles 


e Allows a functional unit to be used more than once per instruction 
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Pipelining Processor Design 


Pipelining is ап implementation technique in which multiple instructions are overlapped in 


execution. Pipeline 1s divided into stages and these stages are connected with one another to 
form a pipe like structure. 


General Principles of Pipelining 
- Express task as a collection of stages 
- Move instructions through stages 
- Process several instructions at any given moment 


Before there was pipelining... 


Single-cycle 





Multi-cycle 





e Single-cycle control: hardwired 

— Low CPI (1) 

— Long clock period (to accommodate slowest instruction) 
е Multi-cycle control: micro-programmed 

— Short clock period 

— High СРІ 
* Can we have both low CPI and short clock period? 





Pipelining 


Multi-cycle 





Pipelined 





"ime 4 ' | 
Р Insn2.tetch nsnz.dec пѕп2.ехес 
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Pipelining Example 
«4» Laundry Example: Three Stages 
1. Wash dirty load of clothes 
2. Dry wet clothes 
3. Fold and put clothes into drawers 


«4» Each stage takes 30 minutes to complete 





«4» Four loads of clothes to wash, dry, and fold 


Sequential Laundry 


е PM f 8 9 10 11 12 AM 






4 Sequential laundry takes 6 hours for 4 loads 


x Intuitively, we can use pipelining to speed up laundry 
Pipelined Laundry: Start Load ASAP 


6 PM f o 9 PM 








PERS Time 








| 30 | 30 | 30 | 


<> = <+ Pipelined laundry takes 
55 Gs 3 hours for 4 loads 
| = кн А “> Speedup factor іс 2 for 
25 S MENS 

= ө <+ Time to wash, dry, and 
CD. BL fold one load is still the 


same (90 minutes) 
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Principles of Pipelined Implementation 


» Break instructions across multiple clock cycles (five, in this 
case). Design a separate stage for the execution performed 
during each clock cycle. 


a. Instruction Fetch (IF) — get instruction from memory, increment PC 

b. Instruction Decode (ID) — translate opcode into control signals and read registers 
c. Execute (EX) — perform ALU operation, compute jump/branch targets 

d. Memory (MEM) — access memory if needed 

e. Writeback (WB) — update register file 


Time 


ШЕШСЕ 


| Program Flow IFech[Dcd [Exec [Mem [WB 








> Add pipeline registers (flip-flops) to isolate signals between 
different stages. 


In pipeline system, each segment consists of an input register followed by a combinational 
circuit. The register 15 used to hold data and combinational circuit performs operations on it. 


The output of combinational circuit is applied to the input register of the next segment. 
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pipeline registers 
We can break the execution into multiple cycles, but keep the 
extra hardware 

Q We need extra registers to hold data between stages to hold 
information produced in previous cycle 

О We may be able to start executing a new instruction at each clock 


cycle (pipelining) 


Synchronous Pipeline 
* Uses clocked registers between stages 
** Upon arrival of a clock edge ... 
+ All registers hold the results of previous stages simultaneously 
% The pipeline stages are combinational logic circuits 
% It is desirable to have balanced stages 
+ Approximately equal delay in all stages 


>” Clock period is determined by the maximum stage delay 
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О Pipeline registers 





Pipeline A wide enough to hold data cqming іп 


97 bits 64 bits 








ADDR 
Instruction 
Memory 


| Register 
|"P File 


IF/ID EX/MEM MEM/WB 
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Stage 1: Fetch 
° Fetch ап instruction from memory every cycle 
— Use PC to index memory 
— |ncrement PC (assume no branches for now) 
* Write state to the pipeline register (IF/ID) 
— The next stage will read this pipeline register 









Instruction 
Cache 





© |F / ID 
Pipeline register 
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Stage 2: Decode 

Decodes opcode bits 

— Set up Control signals for later stages 
° Readinput operands from register file 

— Specified by decoded instruction bits 
Write state to the pipeline register (ID/EX) 

— Opcode 

— Register contents 

— PC+1 

— Control signals (from insn) for opcode and destReg 





IF / ID ID / EX 
Pipeline register Pipeline register 
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Stage 3: Execute 
* Perform АЦ) operations 
— Calculate result of instruction 


“ Control signals select operation 
“ Contents of regA used as one input 


* Either regB or constant offset (from insn) used as second input 
— Calculate PC-relative branch target 
* PC+1+(constant offset) 
* Write state to the pipeline register (EX/Mem) 
— ALU result, contents of regB, and PC+1+offset 
— Control signals (from insn) for opcode and destReg 


target , 











ID / EX EX/Mem 
Pipeline register Pipeline register 
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Stage 4: Memory 
° Perform data cache access 
— ALU result contains address for LD or ST 
— Opcode bits control R/W and enable signals 
° Write state to the pipeline register (Mem/WB) 
— ALU result and Loaded data 
— Control signals (from insn) for opcode and destReg 








| in. data 


Data Cache 











Pipeline register Pipeline register 
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Stage Б: Write-back 
° Writing result to register file (if required) 
— Write Loaded data to destReg for LD 
— Write ALU result to destReg for arithmetic insn 
— Opcode bits control register write enable signal 





Mem/WB 
Pipeline register 


Putting It All Together 
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А pipeline diagram ^ 





=> 
ге Eye 








1 
F | 







lw — StO, 4(Ssp) 
sub  $vO0, 5а0, Sal 
and $t1, St2, St3 
or 550, 551, 552 
add Ssp, Ssp, -4 





bec —F-[- | ex [мем | we 


" A pipeline diagram shows the execution of a series of instructions. 

— The instruction sequence is shown vertically, from top to bottom. 47 

— Clock cycles are shown horizontally, from left to right. 

— Each instruction is divided into its component stages. (We show five 

stages for every instruction, which will make the control unit easier.) 
= This clearly indicates the overlapping of instructions. For example, there 
are three instructions active in the third cycle above. 

— The “lw” instruction is in its Execute stage. 

— Simultaneously, the “sub” is in its Instruction Decode stage. 

— Also, the "and" instruction is just being fetched. 


Pipeline terminology 





Aor V Clock cycle 


lw StO, 4(Ssp) 
sub $%0, 5а0, Sal 
апа 541, St2, 513 
ог 550, 551, 552 
add 5<р, Ssp, -4 








filling full emptying 


" The pipeline depth is the number of stages—in this case, five. 

" |nthe first four cycles here, the pipeline is filling, since there are unused 
functional units. 

" |ncycle 5, the pipeline is full. Five instructions are being executed 
simultaneously, so all hardware units are in use. 

" |ncycles 6-9, the pipeline is emptying. 
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Single Cycle, Multiple Cycle, ув. Pipeline 


<— —— Cycle 1 ——>— ---:---Т Cycle 2 ——— — —» 


Single Cycle Implementation: 





: Cycle 1: : Cycle 2 | Cycle 3: : Cycle 4: : Cycle 8: : Cycle 6 | Cycle 7: : Cycle 8: : Cycle 9 Cyclé 10 


СІК кү ГГА Гала лала ғаға it 


Multiple Cvcle Implementation: 





Pipeline Implementation: 


[к] sa [== Tuz T w ] 


Store 





If there are k staqes, and each staqe takes t 
time units, then the time needed to execute М 


instructions is 


k.t + (N-1).t 
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Pipeline Performance 


Assume time for stages Is 

= 100рѕ for register read or write 

- 200ps for other stages 
Compare pipelined datapath with single-cycle 
datapath 


read access write 
iw  [200ps  |to0ps |200ре  [z200ps  |tcops [BOOBS 
sw  [z00ps [oops |2000 [200s | |70 _ 
(Rtormat |20066  |100ps |200ps | [hoops [eoops _ 
eq — 200p |1006 [200s | | |50 - 





Program 
MT 77770 IH  ——— - 


Time 
Instruction | Data 


800 ps 


order 
(in instructions) 











Iw $1, 100(%0) 


Iw $2, 200($0) 





lw $3, 300(%0) 





Program 
execution 200 400 600 800 1000 1200 1400 


Time 
order 
(in instructions) 


Iw $1, 100(%0) 






| Data 


ч P іш 22522222 =< Ü M =w< o Mie Me We 


200 ps 200 ps 200 ps 200 ps 200 ps 
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біпдіе-Сусіе vs Pipelined Performance 


*” Consider a 5-stage instruction execution in which ... 
<- Instruction fetch = ALU operation = Data memory access = 200 ps 


^ Register read = register write = 150 ps 
“» What is the single-cycle non-pipelined time? 
*% What is the pipelined cycle time? 
*% What is the speedup factor for pipelined execution? 
“» Solution 
Non-pipelined cycle = 200+150+200+200+150 = 900 ps 


| IF [Reg 





| IF [Reg 


900 ps 
“ Pipelined cycle time = max(200, 150) = 200 ps 


= 200 5 IF | Reg | 
+ 200 


+ 200 —— 200 +< 200 +<« 200 +<« 200 > 
«+ CPI for pipelined execution = 1 
+ One instruction completes each cycle (ignoring pipeline fill) 
> Speedup of pipelined execution = 900 ps / 200 ps = 4.5 
<> Instruction count and CPI are equal in both cases 
“» Speedup factor is less than 5 (number of pipeline stage) 


+ Because the pipeline stages are not balanced 
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Instruction- Time Diagram 
“ Diagram shows: 
< Which instruction occupies what stage at each clock cycle 


“» Instruction execution is pipelined over the 5 stages 


Up to five instructions can be in ALU instructions skip 


execution during a single cycle | the MEM stage. 






otore instructions 
skip the VVB stage 
lw %7,8(%3) / 
lw $6, 8($5) 
ori %4,%3,7 
sub $5, $2, $3 
sw $2, 10($3) 


<— Instruction Order — 


= — =n 
CC1 CC2 CC3 CC4 CC5 CC6 CC/ CC8 CC9 Time 


Serial Execution versus Pipelining 


* Consider a task that can be divided into k subtasks 
<> The K subtasks are executed on К different stages 
<> Each subtask requires one time unit 
< [he total execution time of the task is k time units 
* Pipelining Is to overlap the execution 
< The K stages work in parallel on k different tasks 


<= [asks enter/leave pipeline at the rate of one task per time unit 





Without Pipelining With Pipelining 


One completion every K time units One completion every 1 time unit 
иг. AIIITICU Jaucr 2JrilriB 40719 
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Pipeline Speedup 


If all stages are balanced (take the same time) 

Time between instrs “ға = Time between instrSyonirciines / МО. of stages 
potential speedup = number of pipe stages 

If not balanced, speedup is less 


Speedup due to increased throughput 


п D D D D L 


Pipelining does not reduce latency (time for each instruction) of a 
single task 


О it increases throughput of entire workload 


Pipeline Performance 
% Let т = time delay іп stage S; 
< Clock cycle r= max(7) is the maximum stage delay 
“” Clock frequency f = 1/r = 1/max(z) 
“» A pipeline can process n tasks in К + n — 1 cycles 


<> k cycles are needed to complete the first task 


< n — 1 cycles are needed to complete the remaining n — 1 tasks 


“ Ideal speedup of a k-stage pipeline over serial execution 


serial execution in cycles nk 
S, — k for large n 





le = ooo 
Pipelined execution in cycles k+n-— 1 
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Pipeline Performance Summary 


* Pipelining doesn't improve latency of a single instruction 


* However, it improves throughput of entire workload 


+ Instructions are initiated and completed at a higher rate 


* In a k-stage pipeline, k instructions operate in parallel 
+ Overlapped execution using multiple hardware resources 


+ Potential speedup = number of pipeline stages К 
* Pipeline rate is limited by slowest pipeline stage 
“ Unbalanced lengths of pipeline stages reduces speedup 


*« Also, time to fill and drain pipeline reduces speedup 
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Design Instruction Sets for Pipelining 


First, all MIPS instructions are the same length (32-bits). This restriction makes it much 
easier to fetch instructions in the first pipeline stage and to decode them іп the second stage. 


Second, MIPS has only a few instruction formats, with the source register fields being located 
in the same place 1n each instruction. This symmetry means that the second stage can begin 
reading the register file at the same time that the hardware is determining what type of 
instruction was fetched. 


Third, memory operands only appear in loads or stores in MIPS. This restriction means 
we can use the execute stage (3rd stage) to calculate the memory address and then access 
memory in the following stage (dth stage). 


Fourth, operands must be aligned in memory. Memory access takes only one cycle. 


Instruction set architectures and pipelining 
SSS === 
= The MIPS instruction set was designed especially for easy pipelining. 

— All instructions are 32-bits long, so the instruction fetch stage just 
needs to read one word on every clock cycle. 

— Fields are in the same position in different instruction formats—the 
opcode is always the first six bits, rs is the next five bits, etc. This 
makes things easy for the ID stage. 

— MIPS is a register-to-register architecture, so arithmetic operations 
cannot contain memory references. This keeps the pipeline shorter 
and simpler. 

= Pipelining is harder for older, more complex instruction sets. 

— |f different instructions had different lengths or formats, the fetch 
and decode stages would need extra time to determine the actual 
length of each instruction and the position of the fields. 

— With memory-to-memory instructions, additional pipeline stages may 
be needed to compute effective addresses and read memory before 
the EX stage. 
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Pipelined Datapath and Control 
Single-Cycle Datapath 


>” Shown below is the single-cycle datapath 
4«* How to pipeline this single-cycle datapath? 


Answer: Introduce registers at the end of each stage 


IF = Instruction | 10 = Decode and ! EX = Execute апа MEM = Memory ! WB = Write 
Fetch ! Register Fetch  : Calculate Address ' Access Back 


ALU result 


Data 
Memory 


Address 








Instruction 
Address 


Instruction 
Memory 





Data in 





Pipelined Datapath 
“+ Pipeline registers, in green, separate each pipeline stage and hold information 
produced in previous cycle 


“< The registers must be wide enough to store all the data corresponding to the 
lines that go through them. 


< Pipeline registers are labeled by the stages they separate 


< Is there a problem with the register destination address? 
IF = Instruction Fetch | ID = Decode | EX = Execute | MEM = Memory | WB 


Address 


Instruction 


Instruction 
Memory 
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О Right-to-left flow leads to hazards 













Instruction 
Memory 






64 bits 


IF/ID EX/MEM 


Only data flowing right to left may cause hazard 
> Pipeline registers propagate data and control values to later stages. 


> Eachstep of the instruction can be mapped onto the datapath from left 
to right. 
> There are two exceptions to this left -to-right flow of instructions: 


- The only exceptions are the update of the PC (choosing between the incremented 
PC and the branch address). 


- The write-back step, which sends either the ALU result or the data from memory 

to the left to be written Into the register file. 

» Dataflowing fromright toleft does not affect the current instruction; 
pipeline 


these reverse data movements influence only later instructions in the 
Dr. Ahmed Jaber 
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Corrected Pipelined Datapath 


“ Destination register number should come from MEM/WB 
<> Along with the data during the written back stage 
* Destination register number is passed from ID to WB stage 
IF | ID | EX | MEM | WB 


і і 
ID/EX EX/MEN 








i 
MEM/VWB 
i 


Address 


Instruction 


Instruction 
Memory 


IF/ID ID/EX EX/MEM MEM/WB 


Doe 


69 bits 


ADDR 


Instruction RN2 Register 
Memory WN File en 


Destination register number is also passed through ID/EX, EX/MEM 
and MEM/WB registers, which are now wider by 5 bits 
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Pipeline Operation 


Cycle-by-cycle flow of instructions through the pipelined datapath for 
load & store 


| IF for Load, Store, ... 
к 


Instruction fetch 





ID for Load, Store, ... 
Е-е 1 


Instruction decode 
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EX for Load . 


Execution 


IFAD IDVEX EX/MEM MEM^ANB 


Instruction 
memory 











MEM for Load E | 


Memory 


ІРІ IDE} ЕХІМЕМ МЕМ ҮН 


Instruction 


memory 
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WB for Load 


| Iw | 
Write back 


FID IVER ЕХМЕМ МЕТ 

















ғр IDE X EX/MEM MIE MANE 
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EX for Store н 


Execution 


IFAD ID/EX ЕХ/МЕМ MEMWB 


Instruction | L | | 
memory | Б Registera | — x 





MEM for Store 


Memory 


IFAD: ID EX EX/MEM MEMAVE 


insiruction 


memory 
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[РҮП 


WB for Store 


IPE X 


ЕХ/МЕМ 


— 


Write-back 


ЫҒЫН 
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О Example: 

О Consider the following instruction sequence: 
lw $0,  10(S5tl) 
sw 553, 20(5t4) 


add »t5, 956, БЕТ 
sub 2t898,  »t9, БЕ10 


(1 Clock Cycle 1: 


EX/MEM 
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О Clock Cycle 2: 





Dr. Ahmed Jaber 





Spring 2019 


290 


О Clock Cycle 4: 











: T EX/MEM 
I 


MEM/WB 








ADDR 


Instruction 
Memory 
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О Clock Cycle 6: 
SUB ADD SW 


IF/ID ID/EX EX/MEM MEM/WB 







НМ1 


ADDR 
Instruction 





RN2 . 
Register 


Memo 
" ы WN File pepe 





ОЈ Clock Cycle 7: 


SUB ADD 
e ж-------- 
r^ 
Ü IF/ID ID/EX ЕХ/МЕМ MEM/WB 
ч 


ADDR RDF 
Instruction 
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О Clock Cycle 8: 


SUB 


IF/ID 





CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC8 


Iw 510, 10(St1) 


sw 513, 20($t4) 


add $15, 516, $t7 


sub $t8, $19, 5110 
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Pipelined Control (Simplified) 


PCSrc 


МЕМЛУҮВ 


Add ^ dd 
result 






= | Address 
MemtoReg 


Instruction 






Instruction 
(20—16) 








> As was the case for the single-cycle implementation, we assume that the PC is written on each 


clock cycle, so there 1s no separate write signal for the PC. By the same argument, there are no 
separate write signals for the pipeline registers (IF/ID, ID/EX, ЕХ/МЕМ, and MEM/W D), since 


the pipeline registers are also written during each clock cycle. 


» Tospecify control for the pipeline, we need only set the control values during each pipeline stage. 


Because each control line 15 associated with a component active in only a single pipeline stage, 
we can divide the control lines into five groups according to the pipeline stage. 


1. Instruction fetch: The control signals to read instruction memory and to write the PC are always 
asserted, so there is nothing special to control in this pipeline stage. 


2. Instruction decode/register file read: As in the previous stage, the same thing happens at every 
clock cycle, so there are no optional control lines to set. 
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3. Execution/address calculation: The signals to be set аге RegDst, ALUOp, апа ALUSrc. Тһе 
signals select the Result register, the ALU operation, and either Read data 2 or a sign-extended 
immediate for the ALU. 

4. Memory access: The control lines set in this stage are Branch(PCSrc), MemRead, and 
MemWrite. The branch equal, load, and store instructions set these signals, respectively. Recall that 
PCSrc selects the next sequential address unless control asserts Branch and the ALU result was 0. 


5. Write-back:The two controllines are MemtoReg, which decides between sending the ALU result 
or the memory value to the register file, and RegWrite, which writes the chosen value. 


Instruction Instruction Desired ALU control 
opcode operation Function code ALU action input 


foadword | хох [ad | 0000, 
[store word |х [add —  — | 
[Branch equa | branch egua — [Хх [subrat — | omo — | 

| ~ o 


Rtype | 1 © 100100 | 
1 ок | 100101 
1 101010 


ETffect when deasserted (0) Effect when asserted (1) 


RegDst The register destination number for the Write The register destination number for the Write register comes 
register comes from the rt field (bits 20:16). from the rd field (bits 15:11). 


RegWrite None. The register on the Write register input is written with the value 
on the Write data input. 


| ALUSrc The second ALU operand comes from the second | The second ALU operand is the sigr-extended, lower 16 bits of 
register file output (Read data 2). the instruction. 


PCSrc The PC is replaced by the output of the adder that | The PC is replaced by the output of the adder that computes 
computes the value of PC 4 4. the branch target. 


MemRead Data memory contents designated by the address input are 
put on the Read data output. 

MemWrite Data memory contents designated by the address input are 
replaced by the value on the Write data input. 


MemtoReg The value fed to the register Write data input The value fed to the register Write data input comes from the 
comes from the ALU. data memory. 


Execution/address calculation stage Memory access stage Write-back stage 
| control lines control lines control lines | 
Mem- Mem- Reg- Memto- 
Instruction RegDst ALUOp1 ALUOpO ALUSrc Branch Read Write Write Reg 
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What about control signals? 





= The control signals are generated in the same way as іп the single-cycle 
processor—after an instruction is fetched, the processor decodes it and 
produces the appropriate control values. 

. But just like before, some of the control signals will not be needed until 
some later stage and clock cycle. 

" These signals must be propagated through the pipeline until they reach 
the appropriate stage. We can just pass them in the pipeline registers, 
along with the other data. 

" Control signals can be categorized by the pipeline stage that uses them. 


ta Control signals needed 


ALUSrc ^ ALUOp RegDst 
MemRead . MemWrite РС5гс 
WB RegWrite MemToReg 


/ N w — 
Control В 
р " қ | | i 


IF/ID ID/EX EX/MEM MEM/WB 









Instruction 
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IF = Instruction Fetch | ID = Instruction Decode | EX = Execute | МЕМ = Memory Access 


Branch Target Address 





Jump Target = PC[31:28] I Imm26 






Pipeline control signals 









Мехі РС. Address 


Just like data 





Write Back 


і 
і 

Zero | ALU Result 
і 


WB 


Instruction 
Memory 


Address 


Instruction 


ReaDst Reawr 


ji 
E" “eo EXIOD 





Op 


Main & ALU 
func ==\ Control  , 


% ID stage generates all the control signals 





BEQ, BNE 





% Pipeline the control signals as the instruction moves 
<> Extend the pipeline registers to include the control signals 
% Each stage uses some of the control signals 
<> Instruction Decode and Register Read 
= Control signals are generated 
= RegDst and ExtOp are used in this stage, J (Jump) is used by PC control 
+ Execution Stage => ALUSrc, ALUOp, BEQ, BNE 


" ALU generates zero signal for PC control logic (Branch Control) 





+ Memory Stage => MemRd and Метууг 
+ Write Back Stage => RegWr апа MemtoReg 
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Control Signals Summary 


Decode 
Stage 


Op 





ADDI Ü0-Rt 
SLTI О-Кі 


"a| =a 


1=sign 
О-гего 








Ü0-Rt 


X 








EIN 
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Pipeline Hazards 
There are situations іп pipelining when the next instruction cannot 
execute in the following clock cycle. These events are called hazards, and 
there are three different types. 


1. Structural hazards (A required resource is busy) 
+» Caused by resource contention 
+ Using same resource by two instructions during the same cycle 
2. Data hazards (Need to wait for previous instruction to complete its data read/write) 
<> An instruction may compute a result needed by next instruction 
+ Hardware can detect dependencies between instructions 
3. Control hazards (peciaing on control action depends on previous instruction) 
+ Caused by instructions that change control flow (branches/Jumps) 
<> Delays in changing the flow of control 


* Hazards complicate pipeline control and limit performance 





How do we deal with hazards? 


* Common solution is to stall the pipeline until the 
hazard is resolved, inserting one or more 
“bubbles” in the pipeline 


e in the design of pipelined computer processors, 
a pipeline stall is a delay in execution of an instruction in 
order to resolve a hazard. Such an event is often called 
a bubble, by analogy with an air bubble in a fluid pipe. 
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Stalls and performance 
Stalls impede progress of a pipeline and result in 
deviation from 1 instruction executing/clock cycle 


Pipelining can be viewed to: 
— Decrease CPI or clock cycle time for instruction 
— Let's see what affect stalls have on CPI... 


CPI pipelined = 
— Ideal CPI + Pipeline stall cycles per instruction 
— 1+ Pipeline stall cycles per instruction 


Stalls and performance 
Ignoring overhead and assuming stages are balanced: 


Speedup = (OPI Teuunpipelined | 
(1+ pipelined stall cycles per ins.). Tc 


If no stalls, speedup equal to # of pipeline stages in 
ideal case 


Dr. Ahmed Jaber Spring 2019 


301 





“” Problem 
< Attempt to use the same hardware resource by two different 


instructions during the same clock cycle 


4 Example structural Hazard 
u | Two instructions аге 
< Writing back ALU result in stage 4 attempting to write the 


register file during 


< Conflict with writing load data in stage 5 same cycle 





lw $т6,8($55) 
ori $t4, %53,7 
sub %4Б, $s2, $53 
sw $s2,10($s3) 





«— Instructions — 


pp ">= 
ССІ СС2 ССЗ ССА ССБ cc6 ССГ CC8 CC9 Time 


Resolving Structural Hazards 


“» Serious Hazard: 


< Hazard cannot be ignored 


% Solution 1: Delay Access to Resource 
+ Must have mechanism to delay instruction access to resource 
<> Delay all write backs to the register file to stage 5 
s ALU instructions bypass stage 4 (memory) without doing anything 
% Solution 2: Add more hardware resources (more costly) 
+ Add more hardware to eliminate the structural hazard 


< Redesign the register file to have two write ports 
=" First write port can be used to write back ALU results in stage 4 


s Second write port can be used to write back load data in stage 5 
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One Memory Port/Structural Hazards 


Load 









ш = 
— Reg | в DM | 
Jl Y 
Mem L Reg | 297 | DM ( 


Instruction 2 Мет Reg IP DM BP 


Instruction 3 


Instruction 1 








Instruction 4 


Time 







1 : i 1 
i fetch I decode | register | 
i : write i 
i : : i | Д 
' aT : : memory! register : 
' : | | acoess І write i 
| | | | 
| | ALU : memory! register 
: : |; access # write 
: i 1 : 
f 1 Ч 
decode : ALU © memory 
would be a Шыны 
structural : 
hazard If 
there was 
only one 


cache Tor both 
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How is it resolved? 
2 3 4 5 6 7 8 9 






Inetriictiman 1 


Or alternatively... 


Clock Number 








e LOAD instruction “steals” an instruction fetch cycle which will 
cause the pipeline to stall. 


* [hus, no instruction completes on clock cycle 8 
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How About Register File Access? 


Time (clock cycles) 





Fix register file access 





dd $1, x кені erp айы наны 
ль =} cub cie HE | @— 2 
СЕ Gam РР (БІ ЕЕ > 
Г. 

о | Inst x = zc E M c 
d 3 3 
e| ada 82,61, | е] [= à» E = 





clock edge that controls clock edge that controls 
register writing loading of pipeline state 
registers 


Data Hazards 


data hazard Also called a pipeline data hazard. When a planned instruction 
cannot execute in the proper clock cycle because data that is needed to 
execute the instruction is not yet available. 
“» Dependency between instructions causes a data hazard 
* [he dependent instructions are close to each other 

< Pipelined execution might change the order of operand access 
4$ Read After Write — RAW Hazard 

<> Given two instructions x and y, where x comes before у... 

< Instruction y should read ап operand after it is written by x 

< Called a data dependence in compiler terminology 

X: add $1, 52, 53 # гі is written (Fifth stage) 

Y: sub $4, 51, 53 4 гі is read (Second stage) 


<= Hazard occurs when y reads the operand before x writes it 
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Ехат i of a RAW Data Hazard 


value of $2 10 | 10 10 10 10/20: 20 20 E 20 


sub $2, $1, $3 apa AU р . 
and $4, $2, $5 ar am а Reo | 
or $6, $3, $2 анар a = Reo 


add $7, $2, $2 








sw $8, 10($2) 


«*— Program Execution Order —T 


“% Result of sub is enda! by өмі ог, add, & sw instructions 
% Instructions and & or will read old value of $2 from reg file 
% During CC5, $2 is written and read — new value is read 


The SUB does not write to register $2 until clock cycle 5 
causeing 2 data hazards in our pipelined datapath 
“ The AND reads register $2 in cycle 3. Since SUB hasn't 
modified the register yet, this is the old value of $2 
+ Similarly, the OR instruction uses register $2 in cycle 4, again 
before it's actually updated by SUB 


The ADD is okay, because of the register file design 
** Registers are written at the beginning of a clock cycle 
*$* The new value will be available by the end of that cycle 
The SW is no problem at all, since it reads $2 after the 
SUB finishes 
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Solution 1: Stalling the Pipeline 


Time (in cycles) — СС1 TCC2-- CC3 + CCA- CCS 7 ГСС шы DELIS GOB Án 
value of $2 10 | 10 | 10 10 10/20; 20 | 20 | 20 | 


sub $2, $1, $3 яра a Рр E 


and $4, $2, $5 


or $6,$3,$2 hn н 


% Тһе апа instruction cannot fetch $2 until СС5 





+— Instruction Order 


< The and instruction remains in the ІРЛО register until CCo 
* Two bubbles are inserted into ID/EX at end of CC3 & CC4 
< Bubbles are NOP instructions: do not modify registers or memory 


< Bubbles delay instruction execution and waste clock cycles 
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Solution 2. Forwarding 


Forwarding Also called bypassing: A method of resolving a data hazard 
by retrieving the data element from internal buffers rather than waiting 
for it to arrive from programmer visible registers or memory. 


* Generally speaking: 


— Forwarding occurs when a result is passed directly to functional unit 


that requires it. 


— Result goes from output of one unit to input of another 


When can we forward? 


ADD R1, R2, R3 — | K In ы, m x 
| Г : : 


SUB R4, R1, R5 x Mem 






AND Н6, R1, R7 


ОН на, H1, H9 


ХОН R10, H1, H11 


Тіте Rule of thumb: 
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Г] 
EE Reg 


| SUB gets info. 
| from EX/MEM 
|^ pipe register 










_ AND gets info. 
| from MEM/WB 
| pipe register 


Or EM 
x Int | 
im nec gir 


x OR gets info. by 
| forwarding from 
| register file 


If line goes "forward" you can do forwarding. 
If its drawn backward, it's physically impossible. 
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Detecting (һе Need to Forward 


m Pass register numbers along pipeline 
E e.g., ID/EX.RegisterRs = register number for Rs 
sitting іп ID/EX pipeline register 
= ALU operand register numbers іп EX stage аге 
given by 
m ID/EX.RegisterRs, ID/EX.RegisterRt 
m Data hazards when 


la. EX/MEM.RegisterRd = ID/EX.RegisterRs EN 
lb. EX/MEM.RegisterRd = ID/EX.RegisterRt 
2а. MEM/WB.RegisterRd = ID/EX.RegisterRs 
2b. MEM/WB.RegisterRd - ID/EX.RegisterRt 
Solution 2: Forwarding ALU Result 
“» Тһе ALU result is forwarded (fed back) to the ALU input 


< No bubbles are inserted into the pipeline and no cycles are wasted 


%* ALU result exists in either EX/MEM or MEMY/VVB register 









Time (in cycles) Col] Cte ССЗ сае. Co жй сн АА. — 
sub $2, $1, $3 


and $4, $2, $5 





or $6, $3, $2 
add $7, $2, $2 


sw $8, 10($2) 





«— Program Execution Order 
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Dependence Detection 
The sub-and is a first hazard: 
EX/MEM.RegisterRd = ID/EX.RegisterRs = $2 


The sub-or is a second hazard: 


MEM/WB.RegisterRd - ID/EX.RegisterRt - $2 


The two dependences on sub-add are not hazards 
because the register file supplies the proper data 
during the ID stage of add. 


- There is no data hazard between sub and sw because 
sw reads $2 the clock cycle after sub writes $2. 
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Detecting the Need to Forward 


m Pass register numbers along pipeline 
m e.g., ID/EX.RegisterRs = register number for Rs 
sitting in ID/EX pipeline register 
B ALU operand register numbers іп EX stage are 
given by 
m ID/EX.RegisterRs, ID/EX.RegisterRt 
m Data hazards when 


la. EX/MEM.RegisterRd = ID/EX.RegisterRs ES 
lb. EX/MEM.RegisterRd = ID/EX.RegisterRt 


2a. MEM/WB.RegisterRd = ID/EX.RegisterRs 
2b. MEM/WB.RegisterRd - ID/EX.RegisterRt 


However, not vale instructions "cwn register writes. - we аса the — 











hazards. 


m But only if forwarding instruction will write to a 
register! 
ш EX/MEM.RegWrite, MEM/WB.RegWrite 


Also, we do not allow results to be written to the %0 register so, in the event that an 
instruction uses $0 as its destination (which is legal), we should not forward the result 


B And only if Rd for that instruction is not $zero 


m EX/MEM.RegisterRd z 0, 
MEM/WB.RegisterRd # 0 
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Forwarding Conditions 


m EX hazard 


m if (EX/MEM.RegwWrite and (EX/MEM.RegisterRd = О) 
and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) 
ForwardA = 10 
m if (EX/MEM. Reg Write апа (EX/MEM.RegisterRd = О) 
and (E X/MEM.RegisterRd = ID/EX.RegisterRt)) 
ForwardB = 10 
ш MEM hazard 
m if (MEM/WB.RegWrite and (MEM/WB.RegisterRd = О) 
and (MEM/W B.RegisterRd = ID/EX.RegisterRs)) 
Forward^A = OI 
m if (MEM/WB.RegWrite and (MEM/WB.RegisterRd = О) 
and (MEM/W B.RegisterRd = ID/EX.RegisterRt)) 
ForwardB = 01 


Implementing Forwarding 


ID/EX ЕХ/МЕМ MEM/WB 


Data | 
memory 


EX/MEM.RegisterRd 











MEM/WB.RegisterRd 


--/ Forwarding 
unit 
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Pipelined Architecture with Forwarding 


00: Register file to ALU 10: ALU to ALU 
2 
IF/ID 






Registers 


Instruction 
memory 
IF/ID.RegisterRs 
|  |IF/ID.RegisterRt [ЗИ 
| | IE/ID.RegisterRt Rt | 


|__| ___[1FAD.RegisterRd Ra] | | | 


Instruction 






са RegisterRd 


Data 
memory 





01: MEM Data or ALU to ALU 


ForwardA = 00 — | ID/EX The first ALU operand comes from the register file. 


ForwardA - 10 | | | The first ALU operand is forwarded from the prior ALU result. 
Previoues ALU result 


ForwardA = 01 |MEM/WB | The first ALU operand is forwarded from data memory or an earlier 
| ALU result. Second Previous ALU result 





ForwardB = 00 | ID/EX |The second ALU operand comes from the register file. 
ForwardB - 10 | EX/MEM | The second ALU operand is forwarded from the prior ALU result. 


ForwardB = 01 |MEM/WB Тһе second ALU operand is forwarded from data memory or an 
| earlier ALU result. 
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Forwarding Example 


Instruction sequence: 

lw $4, 100(59) 

add $7, $5, $6 

sub 58, 54, 57 
ForwardA = 10 

Forward data from MEM stage 


sub 58,54,57 


Imm26 









Imm 26 








ForwardA = 10 


Ul 

Register d 
File I n 
N É 
[LIT 


ForWarlB = 01 


Instruction 


u 
A 





When 1w reaches the MEM stage 
add will be in the ALU stage 

sub Will be in the Decode stage 
ForwardB - 01 

Forward ALU result from ALU stage 


add 57,55,56 lw $4,100($9) 


ALU result 





(ú 
z 
T 
= 





Rw| B JALU result 
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Forwarding doesn't always work 


LW H1, 0(R2) ІМ 


Load has а latency that 
SUB R4, R1, R5 forwarding can't solve. 
Pipeline must stall until 
hazard cleared (starting 
with instruction that 
wants to use data until 
source produces it). 


AND R6, R1, R7 





OR R8, R1, R9 


Time 


— U 


Can't get data to subtract b/c result needed at beginning of 
CC #4, but not produced until end of CC #4. 


The solution pictorially 
LW R1, 0(R2) x x 
SUB R4, R1, R5 
AND R6, Н1, H7 


OR R8, R1, R9 





Time 





Insertion of bubble causes # of cycles to complete this 
sequence to grow by 1 
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Load Delay 


“» Unfortunately, пої all data hazards сап be forwarded 
~~ Load has a delay that cannot be eliminated by forwarding 
“= In the example shown below ... 
X The LW instruction does not read data until end of CC4 
<= Cannot forward data to AND at end of ССЗ - NOT possible 


Time (in clock cycles) —— i _c_c_@“q_q 
CC 1 GCG P СОЗ CC 4 co 5 CC 6 cc rf сов CCS 
Program 
execution 
order 
(in instructions) 
Iw 52, 20($1) pes T» he Res 5 
ие "d Í = 
and $4, $2, $5 eg => | DM Heg 
жыны — 


or $8, $2, $6 [U | тар” T» 
L 


mHE m 


sit $1, $6, $7 





4» Detecting a RAW hazard after a Load instruction: 
< The load instruction will be in the EX stage 


<> Instruction that depends on the load data is in the decode stage 


3 Condition for stalling the pipeline 
if (ID/EX.MemRead 


and ((ID/EX.ReaisterRt = IF/ID.RegisterRs) or 


(ID/EX.RegisterRt = IF/ID.RegisterRt))) 
stall the pipeline 


<. The first line tests to see if the instruction is a load: the only instruction that reads data 
memory is a load. The next two lines check to see if the destination register field of the load 
in the EX stage matches either source register of the instruction in the ID stage. 


= If the condition holds, the instruction stalls one clock cycle. After this 1-cycle stall, the 
forwarding logic can handle the dependence and execution proceeds. 
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Ргодгат Time (іп clock cycles) 
execution 


CC 1 CC2 сез CC4 Сез ссе ССТ CC8 


order 
(in instructions) 


* 
%% 


or $8, $2, 86 


Iw $2, 20($1) Red _ 


idia 





E 
um 
(ЕШ 

dre 


тр) аг Die 










and becomes nop 





a 


and $4, $2, $5 





add $9, $4, $2 





CC9 CC10 


A bubble is inserted beginning in clock cycle 4, by changing the and 


instruction to а nop. Note that the and instruction 15 really 


fetched and 


decoded in clock cycles 2 and 3, but its EX stage 15 delayed until clock 


cycle 5 (versus the unstalled position in clock cycle 4). 


Likewise the OR instruction 15 fetched in clock cycle 3, but its IF stage 1s 
delayed until clock cycle 5 (versus the unstalled clock cycle 4 position). 
After insertion of the bubble, all the dependences go forward in time and 


no further hazards occur. 
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Pipelined control overview, showing the two multiplexors for forwarding, the hazard 
detection unit, and the forwarding unit. Although the ID and EX stages have been 


simplified-the sign-extended immediate and branch logic аге missing-this drawing 
gives the essence of the forwarding hardware requirements. 


IF DNVELIEG 


PCWrite 
|. 
C 


Instruction | 


memory 
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| | Control ——u M 
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J Forwarding | 
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Write After Read - WAR Hazard 
3$ Instruction J should write its result after it is read by г 
= Called an anti-dependence by compiler writers 
I: sub 54, 51, S3 # 51 is read 
J: ааа 51, 52, 5З 4 51 is written 
+ Results from reuse of the name $1 
++ Hazard occurs when J writes $1 before I reads it 


+ Cannot occur in our basic 5-stage pipeline because: 
= Reads аге always in stage 2, and 
< Writes are always in stage 5 


<= Instructions are processed in order 


Write After Write - WAW Hazard 
“» Instruction J should write its result after instruction I 
** Called an output-dependence in compiler terminology 

I: sub 51, 54, $3 # S1 is written 

J: add 51, 52, S3 d 51 is written again 
4$ This hazard also results from the reuse of name $1 
“» Hazard occurs when writes occur in the wrong order 
“” Can't happen in our basic 5-stage pipeline because: 

<> All writes are ordered and always take place in stage 5 

%• WAR апа WAW hazards can occur in complex pipelines 
“» Notice that Read After Read — RAR is NOT a hazard 
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Control Hazards 


“> Jump and Branch can cause great performance loss 


“> Jump instruction needs only the jump target address 
“» Branch instruction needs two things: 


< Branch Result Taken or Not Taken 
<> Branch Target Address 
= PC+4 lf Branch is NOT taken 


= PC +4 + 4 x Immediate lf Branch is Taken 


“” Jump and Branch targets are computed in the ID stage 
< At which point a new instruction is already being fetched 


<> Jump Instruction: 1-cycle delay 


<> Branch: 2-cycle delay for branch result (taken or not taken) 


1-Сусіе Jump Delay 


> Control logic detects a Jump instruction іп the 274 Stage 
** Next instruction is fetched anyway 


“ Convert Next instruction into bubble (Jump is always taken) 


cc cc? cc3 cc4 сс5 ссб сс/ 


Next instruction 


L1: Target instruction 





Dr. Ahmed Jaber Spring 2019 


320 


2-Cycle Branch Delay 


^ 


ж 


Control logic detects a Branch instruction іп the 274 Stage 


+ 
M 


ALU computes the Branch outcome in the 3'? Stage 


<• Next1 and Next2 instructions will be fetched anyway 


^^ 


е Convert Next1 and Next2 into bubbles if branch is taken 


сс1 


сс2 cc3 
Бес $t1,$t2,L1 | Reg | | ТЕТІ 


Next? 


cc4 cc5 ссе єє 





Branch 
L1: target instruction Target 


Predict Branch NOT Taken 
“ Branches сап be predicted to be NOT taken 
“» If branch outcome is NOT taken then 
<= Nextt and Next? instructions сап be executed 


< Do not convert Next1 & Next2 into bubbles 


+ No wasted cycles 





сс4 сс5 ссб ССТ 


Веч 511,512 ,11 


Мехі1 


Next? 
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Memory Hierarchy and Caches 


Random Access Memory 





4* Large arrays of storage cells 
>> Volatile memory 
< Hold the stored data as long as it is powered on 
“” Random Access 
<> Access time Is practically the same to any data on a RAM chip 
“> Output Enable (OE) control signal 
<> Specifies read operation 


*% VVrite Enable (WE) control signal ШЕ" 





< Specifies write operation 


4 2^ x m RAM chip: n-bit address and m-bit data 
Memory Technology 
“>” Static RAM (SRAM) for Cache 


< Requires 6 transistors per bit 
< Requires low power to retain bit 
< Dynamic RAM (DRAM) for Main Memory 
<> One transistor + capacitor per bit 
< Must be re-written after being read 
<> Must also be periodically refreshed 


= Each row сап be refreshed simultaneously 
<> Address lines are multiplexed 
= Upper half of address: Row Access Strobe (RAS) 
" Lower half of address: Column Access Strobe (CAS) 


Dr. Ahmed Jaber Spring 2019 


322 
Static RAM Storage Cell 
“ Static RAM (SRAM): fast but expensive RAM 
“% 6-Transistor cell 
“” Typically used for caches 


“» Provides fast access time 


Dynamic RAM Storage Cell 
“” Dynamic RAM (DRAM): slow, cheap, and dense memory 
“» Typical choice for main memory 
“ Cell Implementation: 
< 1-Transistor cell (pass transistor) 
<> capacitor (stores bit) 
*«* Bit is stored as a charge on capacitor 


“» Must be refreshed periodically 


Memory Latency versus Bandwidth 


А. 


“ Memory Latency 
<+ Elapsed time between sending address and receiving data 
+ Measured in nanoseconds 

“% Memory Bandwidth 
<+ Rate at which data Is transferred between memory апа CPU 


<> Bandwidth is measured as millions of Bytes per second 
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Typical Memory Hierarchy 
<, Registers are at the top of the hierarchy 
< Typical size < 1 KB 
< Access time < 0.5 ns 
4» Level 1 Cache (8 — 64 KB) Microprocessor 


Memory Bus 






<- Access time: 1 ns 

4$ L2 Cache (912KB — 8MB) 
< Access time: З — 10 ns 

“” Main Memory (4 — 16 GB) 
< Access time: 50 — 100 ns 

“> Disk Storage (> 200 GB) МО Bus 


< Access time: 5 — 10 ms 
The Need for Cache Memory 


*” Widening speed gap between CPU and main memory 






Faster 
Bigger 






+ Processor operation takes less than 1 ns 


<> Main memory requires about 100 ns to access 


* 


* Each instruction involves at least one memory access 


* 


+ One memory access to fetch the instruction 


+ A second memory access for load and store instructions 


* 


* Cache memory can help bridge the CPU-memory gap 


* 


“ Cache memory is small in size but fast 
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The Locality Principle 
Кесер the most oftem-used data іп a small, fast 
SRAM (often local to CPU chip) 


Refer to Main Memory only rarely, for remaining 
data. 


The reason this strategy works: LOCALITY 
Locality of Reference: 


Access to address X at time t implies that 
access to address X+AX at time t+At 


becomes more probable as AX and At 
approach zero. 





There are two different types of locality: 
Temporal locality (locality in time): The principle stating that if 
a data location is referenced then it will tend to be 
referenced again soon. 

% Caches exploit temporal locality һу... 


+ Keeping recently accessed data closer to the processor 


Spatial locality (locality in space): The locality principle stating 
that if a data location is referenced, data locations with 
nearby addresses will tend to be referenced soon. 

% Caches exploit spatial locality Бу... 


* Moving blocks consisting of multiple contiguous words 
» We take advantage of the principle of locality by implementing the 


memory of a computer as a memory hierarchy. A memory hierarchy 
consists of multiple levels of memory with different speeds and 
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sizes. Тһе faster memories are more expensive per bit than the 
slower memories and thus are smaller. 

Memory Hierarchy 
A structure that uses multiple levels of memories; as the 


distance from the processor increases, the size of the memories 
and the access time both increase. 





On the 
datapath | cycle | KB Software/Compiler 
2-4 cycles 32 KB Hardware 
10 cycles 256 KB Hardware 
On chip — 40 cycles І0 MB Hardware 
Other 200 cycles 10 GB Software/OS 
chips 
10-1000 100 СВ Software/OS 
Mechanical IOms ITB Software/OS 
devices 


Memory Parameters: 





* Access Time: increase with distance from CPU 
* Cost/Bit decrease with distance from CPU 
1. Capacity: increase with distance from CPU 


memory? 


— — ? 
SS == (ні 
Speed: Fastest Slowest Fast 
Capacity: Smallest Largest Large 
Cost: Highest Lowest Cheap 
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Memory Hierarchy Levels 
О Hierarchy is inclusive, every level is subset of lower level 


Block of data | 
(unit of data copy) 







Increasing distance 
from the CPU in 


M anl. i ' 
| access time 
| Levels in the Level 2 
Data are transferred memory hierarchy 


size of the memory at each level 








О Block: minimum unit of data to move between levels 
О If accessed data is present in upper level (closer to CPU) 
v A hit: data requested is in upper level 
v hit time: time to access and deliver the data from the upper level 
Y A hit ratio: percentage of time the data is found in the upper level 
(hits/accesses) 
(1 If accessed data is absent 
v miss: data requested is not in upper level 
v i.e. a block copied from lower level (farther from CPU) 
{v miss penalty: time to access апа copy data from lower level to upper 
level, then to CPU 
v miss ratio: : percentage of time the data is not hits. 


E miss ratio = 1 - hit ratio 
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Маш Memaory 


La 


Main Memory 





Cache 





HIT 


т | A is found in the cache. 
CPU requests А cache provides CPU with the contents of А 


- Маш Memorv 
Main Memory t 


Cache 


À is not found in the cache. 
Cache request to main memory 





CPU requests À 


Main Memory | 


E. 





Contents are copied to cache 


Memory access, resulting in a hit or a miss 


A hit occurs if the data required by the processor appears in some block in the 
upper level and a miss occurs if this 15 not the case and the lower level needs 
to be accessed to copy the block that contains the data requested by the CPU 
into the upper level. 
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> In the following figure, the cache contains a collection of recent references X |, 


X2, ..., Xn1 , and the processor requests a word X, that is not in the cache. 
This request results in a miss, and the word X, is brought from memory into 
the cache. 


Reference to X, 
Causes miss so 
/ it is fetched from 
memory 








a. Before the reference to Xn b. After the reference to Хп 


LJ Issues 
LJ how do we know if a data item is in the cache? 
LJ if it is, where do we find it? 


L1 if not, what do we do? 
> The simplest way to assign a location in the cache for each word іп memory 
is to assign the cache location based on the address of the word in memory. 


> This cache structure is called direct mapped, since each memory location is 
mapped directly to exactly one location in the cache. 


> direct-mapped cache: A cache structure in which each memory location is 
mapped to exactly one location in the cache. 


> almost all direct-mapped caches use this mapping to find a block: 


(Block address in main mem.) MOD (Number of blocks in the cache) 


> In fact, this equation can be implemented in a very simple way if the number 
of blocks in the cache 15 a power of two, 2* , since 


(Block address n main mem.) MOD 2* = x lower-order bits of the block 
address 
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Block Placement: Direct Mapped 


<+ Block: unit of data transfer between cache and memory 


“> Direct Mapped Cache: 


<= A block can be placed іп exactly one location in the cache 


< 


In this example: 








Cache index — 
least significant 3 bits 
of Memory address 





Drawback: may overwrite some parts of cache while other parts are empty 


A given memory block can be mapped into one and 


only cache line. Here is an example of mapping 


Cache line Main memory block 





Advantage 
Чо need of expensive associative search! 


Disadvantage 
Miss rate may go up due to possible increase of 


mapping conflicts. 


> Since each block in the cache can contain the contents of different memory 
locations that have the same x least-significant address bits, every block in 
the cache is augmented with a tag field. The tag bits allow to uniquely identify 
which memory content is stored in a given block of the cache. 
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О Location determined by address: direct mapped 
Y cache block address = memory block address mod cache size (unique) 
Y if cache size = 2", cache address = lower m bits of n-bit memory 
address 
remaining upper n-m bits kept as tag bits at each cache block 


also need a valid bit to recognize valid entry 


Inside a Cache Memory 





= w 


"E 
x 

Tags Cache Block 0 9 
identify Cache Block 1 m 
Ф 

blocks in i ЕС 
б 

the cache Cache Block N — 1 О 
z 
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Direct-Mapped Cache 


4” 


» А memory address Is divided into 


<> Block address: identifles block in memory Block Address 
+ Block offset: to access bytes within a block lex 





4% 


* А block address is further divided into 
V Tag Block Data 


<> Index: used for direct cache access | | 
+ Tag: most-significant bits of block address | 


ШЕН 

ШЕН 

ЕЕЕ 
Index = Block Address mod Cache Blocks D. 

| 1 

ин 


* Тад must be stored also Inside cache Ia 


<+ For block identification 





+ 
+... 


А valid bit Is also required to Indicate 


< Whether a cache block is valid or not 


What if there is no data in a location? 
< Valid bit: 1 = present, 0 = not present , Initially 0 


“» Cache hit: block is stored inside cache 
<= Index is used to access cache block 
< Address tag is compared against stored tag 
<> If equal and cache block is valid then hit 


<> Otherwise: cache miss 


ф 
y^ 


If number of cache blocks is 27 


< n bits are used for the cache index 


$ 
.” 


If number of bytes іп а block іс 22 
<> b bits are used for the block offset 


+ 
.” 


If 32 bits are used for ап address 
<= 32 — n — b bits are used for the tag 


“” Cache data size = 272 bytes 
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% Example 
< Consider а direct-mapped cache with 256 blocks 
<> Block size = 16 bytes 
<> Compute tag, index, and byte offset of address: OxO1FFF8AC 


i - Block Address 
4$ Solution = s 


< 32-bit address is divided into: | | 
" 4-bit byte offset field, because block size = 24 = 16 bytes 





1 8-bit cache index, because there are 2° = 256 blocks in cache 
" 20-bit tag field 
<> Byte offset = OxC = 12 (least significant 4 bits of address) 
<> Cache index = 0x8A = 138 (next lower 8 bits of address) 
<> Tag = OxO1FFF (upper 20 bits of address) 


Example 
* Consider a small direct-mapped cache with 32 blocks 
+ Cache Is initially empty, Block size = 16 bytes 
<> The following memory addresses (іп decimal) are referenced: 
1000, 1004, 1008, 2548, 2552, 2556. 


+ Map addresses to cache blocks and indicate whether hit or miss 





* Solution: | inde» 
<+ 1000 = Ox3E8 cache index = Ox1E Miss (first access) 
+ 1004 = ОХЗЕС cache Index = Ox1E Hit 
<+ 1008 = Ox3FO cache Index = Ox1F Miss (first access) 
< 2948 -Ox9FA cache index = Ox1F Miss (different tag) 
+ 2992 =Ox9F8 cache index = Ox1F Hit 
+ 2996 -Ox9FC cache index = Ox1F Hit 
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Example 


At power-up, every cache line 16 Invalid (V=0). Let's consider the following sequence 
of memory references: 101102, 110102, 101102, 100002, 10010». 


Index Data (block — 32 bits) 








» For the first memory access, at 101105, the 3 LSB, to index the cache, are 110. 
The corresponding block in the cache is invalid (V = 0), so we have a cache miss. 
The block containing the requested word is copied into the cache from the next 
level below in the memory hierarchy, the tag bits are set to 10 and the valid bit 
is set (as the cache block 1s now valid), resulting in the following state of the cache. 


Index Data (block — 32 bits) 





> The next access is at word address 110102. The index bits are 010. The 
corresponding block in the cache 15 invalid again, so we have a cache miss, сору 
the appropriate block from main memory, set the tag bits to 11 and the valid bit 
to 1, resulting in the cache state below. 


Index Data (block — 32 bits) 
000 





Mem[10110;]| 
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> The next access is at word address 101102. The index bits are 110. The 
corresponding block of the cache 1s valid (V = 1), with tag bits 10, which match 
the tag bits of the word address 101102. This implies a cache hit, so the cache can 
provide the CPU promptly with the requested data, Mem[101 10; ]. 


> The next accessis at word address 100002. The index bits are 000 , which 
corresponds to an invalid cache block and thus a miss. Copying the right block 
from main memory into the cache and adjusting tag and valid bit results іп the 
following state of the cache. 





> Lastly, 10010: is accessed. The block indexed by 010 is valid, however, the tag 
bits of the word address, 10, don't match the tag of the corresponding cache block, 
which 1s 11. 


» This implies the block indexed by 010, in the cache, is storing the memory word 
at 11010» and not the memory word at 100102. Therefore, we have a cache miss 
and replace this block in the cache by a new block, 1.е., the contents of 1001021n 
main memory. After updating the tag, the cache has been updated as follows. 


Index Tag | Data (block — 32 bits) 
Мет [100005] 
ооо [ор j —  — —- 
1 Метп|10010-| 
“оп |0: 

7100 [0- 


10 Метп|101102| 
ио" 
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Exercise# |Below is a list of 32-bit memory address references. given as word addresses 
3, 180, 43, 2, 191, 88, 190, 14, 181, 44, 186, 253 
For each of these references, identify the binary address, the tag, and the 


index given a direct-mapped cache with two-word blocks and a total size 
of 8 blocks. Also list if each reference is a hit or miss, assuming the cache is 





Solution 
The block size is 2 words, so you need 1 offset bit (because 2!22). You have 8 blocks, so you need 


3 index bits to give 8 different row indices (because 23-8). That leaves you with the remaining 28 


bits for the tag. 


Word Binary 
Address | Address offset Hit/Miss 











— 


STETS 
ЕЕ ШЕН 
= | 


= 
О 
Ld 





PIP|OIPIP IE 
(сынын 
BE > | о | | | | 





i 
I 
) 
: 
) 
: 
I 
; 
: 
I 


1111C 
Note: Shift right: 180 = 10110100 =1011010=90 the first bit: 0 (offset) 
90 mod 8 =2 
Shift right: 43 = 00101011 =0010101= 21 the first bit: 1 (offset) 
21 mod 8 =5 
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Exercise 5.2.3 


You are asked to optimize а cache design for the given references. There аге three direct-mapped 
cache designs possible, all with a total of 8 words of data: СІ has 1-word blocks, C2 has 2-word 
blocks, and C3 has 4-word blocks. In terms of miss rate, which cache design is the best? If the miss 
stall time 15 25 cycles, and СІ has an access time of 2 cycles, C2 takes 3 cycles, and C3 takes 5 


cycles, which is the best cache design? 


Word Binary 
Address Address | Tag 


‘0000 0011 | 0 
(0101011| 5 _ 
00000010] 0 


Cache 1 miss rate = 100% (12 Word Address : miss ) 


Solution 
Cache 1 Сасһе 2 Сасһе 3 





hit/miss hit/miss hit/miss 





= = = = с 














Cache 1 total cycles = 12 X 25 + 12 X 2 = 324 
Cache 2 miss rate = 10/12 = 83% 
Cache 2 total cycles = 10 X 25 + 12 X 3 = 286 
Cache 3 miss rate = 11/12 = 92% 
Cache 3 total cycles = 11 X 25 + 12 X 5 = 335 


Cache 2 provides the best performance. 
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Address тарріпо for direct-mapped cache 


requested address: 


tae index byte offset 





N Cache 






Index | valid tag data block 


Bits in a Cache 


The total number of bits needed for a cache is a function of the cache size 
and the address size, because the cache includes both the storage for the 
data and the tags. For the following situation: 


m 32-bit addresses 
B А direct-mapped cache 
m The cache size 15 2" blocks, so n bits are used for the index 
m The block size is 2" words ог 2"* bytes, so т bits are used for the word 
within the block, and two bits are used for the byte part of the address 
B The size of the tag field is: 32 - (п +m + 2) 
B The total number of bits іп a direct-mapped cache 15 


2" * (block size + tag size + valid field size) 
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О Example: 


Cache with 1024 1-word blocks: byte offset (least 2 significant bits) is ignored and next 10 bits 
used to index into cache 


Address showing bit positions 
3130 .--131211 ..410 
S l EF 
offset 
x 20 10 | 
Hit Tag Data 


Index 


Index Valid Tag Data 


o l|  , | 
Цр 
2 J| ЕН 


к у 
тт! f | J| 
|l j| | | 
„шш | БЕНЕН БЕНЕН 
021 || jJ | j ,., 
0—2 | [| | j |, 
ш ЕЕ БЕНЕН БЕНЕН 





> This cache holds 1024 words or 4 KB, because the cache has 2/9 (or 1024) words 
and a block size of one word, 10 bits are used to index the cache. 


> We assume 32-bit addresses in this example. The tag from the cache 15 compared 
against the upper portion of the address to determine whether the entry in the 
cache corresponds to the requested address, leaving 32 —10 — 2 = 20 bits to be 
compared against the tag. 


» |f the tag and upper 20 bits of the address are equal and the valid bit is on, then 
the request hits іп the cache, and the word 15 supplied to the processor. Otherwise, 


a miss occurs. 


Dr. Ahmed Jaber Spring 2019 


339 


EXAMPLE 





How many total bits are required for a direct-mapped cache with 16 KiB of 
data апа 4-word blocks, assuming а 32-bit address? 


ANSWER 


We know that 16 КІВ is 4096 (27) words. With a block size of 4 words (27), 
there are 1024 (2?) blocks. Each block has 4 X 32 or 128 bits of data plus a 
tag, which is 32 — 10 — 2 — 2 bits, plus a valid bit. Thus, the total cache size is 


20 x (4 X 32 + (32 — 10 — 2 — 2) + 1) = 2” x 147 = 147 КЫ 


or 18.4 KiB for a 16 KiB cache. For this cache, the total number of bits in the 
cache is about 1.15 times as many as needed just for the storage of the data. 


О Example: 
О How many total bits are required for a direct-mapped cache with 128 KB of 


data апа 1-word block size, assuming а 32-bit address? 


О Cache data = 128 KB = 27 bytes = 25 words = 215 blocks 
О Cache entry size = block data bits + tag bits + valid bit 
= 32 + (32 - 15 - 2) + 1 = 48 bits 
О Therefore, cache size = 215 x 48 bits =215 x (1.5 x 32) bits 
= 1.5 x 220 bits = 1.5 Mbits 
О data bits in cache = 128 KB x 8 = 1 Mbits 


О total cache size/actual cache data = 1.5 
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Multi-word Cache Blocks (Direct Mapping) 


32-bit memory address 


index into cache LU | index into word 


lag (cache - 2^m blocks) | (word = 4 bytes) 
































index into block 
(block = 2^n words) 
е e.g., m=5, n=4 (16 words per block, 32 blocks in cache: cache stores 32*16 
words) 
e 110110001010100110101 11001 1010 01: 
e° byte #1 of 10th word in 25th block 
e All words whose address is prefixed with 110110001010100110101 11001 
moved into the 25th block of the cache simultaneously 
(1 Cache with 4K 4-word blocks: first 2 bits are byte offset is ignored, next 2 


bits are block offset, and the next 12 bits are used to index into cache 
31---16 15--4 3210 


' 16 12 2 Вуіе | 
ёа Тад offset уш 
Index Block offset 
16 bits 128 bits 
E — 


V Tag Data 


L T —— j T EF 


a a аараан таалаа | 


K 
entries 
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е Example: 
е 64 blocks, 16 bytes/block 
е То what block number does address 1200 map? 
Block address = 1 200/16 | — 75 
Block number = 75 modulo 64 = 11 





22 bits 6 bits 4 bits 


This block maps all addresses between 1200 and 1215 


e Block address = floor(1200/16) = 75 (75'^ block in memory) 


e Block number — 75 modulo 64 — 11 (Direct mapping, 
would map to 11'^ block in cache) 


32-bit memory address 





оне — pepe 


We first find out the memory block number that byte address 1200 
belongs to. Since the size of a block is 16 bytes. 


Byte address Ü to 15: block 0 
Byte address 16 to 31: block 1 
Byte address 32 to 47: block Z, and so on. 


Byte address 1200 will belong to block number: floor(1200/16) - 75. 
For direct mapped cache, 
Cache block no. - (Memory block no.) MOD (No. of cache blocks) 
= 75 MOD 64 = 11. 
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Set-Associative Cache 


“» A set is a group of blocks that can be indexed 


< Set index = Block address mod Number of sets іп cache 


* |f there аге m blocks in a set (m-way set associative) then 


< т tags are checked in parallel using m comparators 
“» If 2"? sets exist then set index consists of n bits 
“» A direct-mapped cache has one block per set 


«4» A fully-associative cache has one set 


Fully Associative Cache 
4 A block can be placed anywhere in cache > no indexing 


% |f m blocks exist then 


< m comparators are needed to match tag 
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Direct Mapped 


соо у у мА 
n " — ] E Tag | Index Offset 
| ІП nl A cache block can only go in one spot in the 
es > Se 





cache, It makes a cache block very easy to 
find, but it's not very flexible about where 
to put the blocks. 





2-Way Set Associative 


Tag Index Offset 


This cache is made up of sets that can fit 
two blocks each. The index is now used to 
find the set, and the tag helps find the 
block within the set. 


4-Way Set Associative 
Tag | Index | Offset 


Each set here fits four blocks, so there are 
fewer sets, As such, fewer index bits are 
needed. 





Fully Associative 
Tag |. Offset 
Мо index is needed, since a cache block can 
go anywhere in the cache. Every tag must be 


compared when finding a block in the cache, 
but block placement is very flexible! 
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Direct Mapping 
* Each block from memory can only be put in one location 


* Given n cache blocks, 
MIM block i maps to cache block i mod n 





Memory 
E | = O mod 4 
— | | Block1 |=1mod4 
= 3 mod 4 
= O mod 4 
= 1 mod 4 
= 2 mod 4 
= 3 mod 4 
= O mod 4 





K-way Set-Associative Mapping 
- Given, S sets, block i of MM maps to set i mod s 
= Within the set, block can be put anywhere 
- Let k — number of cache blocks per set = n/s 
— K comparisons required for search Memory 
- ( Set O 
Cache | | — 3lock 1 | Set 
' ' . Set O 
Set 1 
Set O 
Set 1 
Set O 
Set 1 
Set O 








Fully Associative Mapping 
* Any block from memory can be put in any cache 
block (i.e. no restriction) 
-- Implies we have to search everywhere to determine hit or 


miss Memory 


Block О | 
Біоск 1 
Біоск 2 


Біоск 6 
Біоск 7 
Block 8 













_ Сасе . — 
Cache Block О ж — | 
"Cache Block 1 — — 
"Cache Block 2 ko —— 
"Cache Block 3 К 
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Direct Mapped 
* Each block mapped to exactly 1 cache location 
Cache location = (block address) MOD (# blocks in cache) 


0 1 2 35 45 6 7 


ШЕП 


0 12 3 456 7 8 9 1011 12 13 1415 16 17 18 19 20 2122 23 24 25 26 27 28 29 30 31 








Fully Associative 


* Each block mapped to any cache location 


Cache location - any 


0 12 3 45 6 7 





0 1 2 3 45 6 7 8 9 10 11 12 15 14 15 16 17 18 19 20 2122 23 24 25 26 27 28 29 30 31 














AUTHORE IH LH ELLE E EUG. 
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Set Associative 
* Each block mapped to subset of cache locations 


Set selection = (block address) MOD (# sets in cache) 


0 123 456 7 





2-уғау set associative — 2 blocks in set 
4 3] This example: 4 sets 





Set 0 


0123 45 6 7 8 9 10111213 14 15 16 17 18 19 20 2122 23 24 25 26 27 28 29 30 31 





























Direct mapped Set associative 


Fully associative 


Bock 01234567 Set? 0 


i I i l i 


Tag Tag 


Search 





Search 





Search in 8 
locations 


Searchin2 © 
locations 


The location of a memory block whose address is 12 in a cache with 8 blocks varies for direct-mapped, set associative 
, and fully associative placement. Іп direct-mapped placement, there is only one cache block where memory block 12 can be found 





and that block is given by (12 modulo 8) = 4. In a two-way set-associative cache, there would be four sets, and memory block 12 must be in set 


(12 mod 4) = 0; the memory block could be in either element of the set. In a fully associative placement, the memory block for block address 12 can 
appear in any of the eight cache blocks. 
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Spectrum ОҒ Associativity 


C) Fora cache with 8 entries (8-block) with different degrees of associativity: 
One-way set Associative 
(direct mapped) 
Block Tag Data 


Two-way set associative 

2 Set Tag Data Tag Data 
3 0 

E ТЕ 
5 2 

6 d 

НЕ 


Four-way set associative 


Set Tag Data Tag Data Tag Data Tag Data 


Eight-way set associative (fully associative) 


Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data 


— | sofset — —_ 

# of blocks in cache 

Set associative | (# of blocks in cache)/ Associativity (typically 
associativity 2 to 16) 

Fully associative # of blocks in cache 


| _ Location method 
Direct mapped 
Set associative | Index the set; compare Degree of 

sets tags associativity 


Fully associative | Compare all blocks tags # of blocks 
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How Do We Find a Block in The Cache? 


* Our Example: 
- Main memory address space = 32 bits (= 4GBytes) 
- Block size - 4 words - 16 bytes 
- Cache capacity - 8 blocks - 128 bytes 


block address 


32 bit Address |! 


i tag block offset 


28 bits 4 bits 
index = which set 


tag = which data/instruction in block 
block offset = which word in block 


Finding a Block: Direct-Mapped 
























| ЕЕЕ БЕНЕН БЕНЕН l1 _ 
5 І ЕНЕ ЕЕ БЕНЕН г БЕКШЕ 
Entries} [L I I 1 1 
«УА H j| j j I көне 
| H j j 1 у 
3 L lp 
L j j 1 l1 
| И И у 1 
| With cache capacity = 8 blocks 
Hit Data 
Address 
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Finding А Block: 2-Way Set-Associative 


2 elements per set 








} 


Address iss 
Finding A Block: Fully Associative 


L l| | | gd JU l l d ү d Ji I L l 1! ` 
Y YY Y Y Y Y s Y Y Y s ЕЖ 

















28 


5—7 







































= GE e ү 


Т | 


LX [SE S |< 


м. | 
( 6—3 | ú 
| | | | | 


Б 
Т 









Tag| 
Address Hit 
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Problem 
A processor has a 32-bit memory address space. The memory is broken into 
blocks of 32 bytes each. The cache is capable of storing 16 kB. 
e How many blocks can the cache store? 
e Break the address into tag, set, byte offset for direct-mapping cache. 
e Breakthe address into tag, set, byte offset for a 4-way set-associative 
cache. 


Solution 
e 16 kB / 32 bytes per block = 512 blocks. 
e Direct-mapping: 18-bit tag (rest), 9-bit set address, 5-bit block offset. 
e 4-way set-associative: each set has 4 lines, so there аге 512 / 4 = 128 sets. 
o 20-bit tag (rest) 
o  7-bit set address 
o  5-bit block offset 


Problem 
A processor has a 36-bit memory address space. The memory is broken into 
blocks of 64 bytes each. The cache is capable of storing 1 MB. 
e How many blocks can the cache store? 
e Breakthe address into tag, set, byte offset for direct-mapping cache. 
e Breakthe address into tag, set, byte offset for a 8-way set-associative 
cache. 


Solution 
e 1MB/64 bytes per block = 2**(20-6) = 16k blocks. 
e Direct-mapping: 16-bit tag (rest), 14-bit set address, 6-bit block offset. 
e 8-way set-associative: each set has 8 lines, so there are 16k / 8 = 2k sets 
o 19-bit tag (rest) 
o 11-bit set address 
o 6-bit block offset 
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Example 

Ш Compare 4-block caches 
ü Direct mapped, 2-way set associative, fully associative 
ГІ Block access sequence: О, 8, O, 6, 8 

Іі Direct mapped: 





Block address Cache block 
O O (= 0 modi 4) 
6 2 (= 6 mod 4) 
| 8 O (= 8 mod 4) | 


Address of memory де | 
block accessed шалын 


O o [ms моо — | T o 
ОО в [тв Mey | | — j — —_ 
о | ms [Memoir | | — | SC 
в | ms | Memorial | мето | | 
Oo s | mss | Memon) | | Mem | 








Block address Cache set 
0 (= 0 mod 2) | 
0 (= 6 mod 2) 
0(- 8 mod 2) | 


LJ 2-way set associative: 











Contents of cache blocks after reference 





Address of memory 





как a en шөкті 
в | mss | метою) | Мето | _ 
О о | ht |Memnf) |Memn | — | 
О в | mss Метоуо/ | Метоу | Si 
ООО О в | mss | метов) | Memo | — | — 


Choosing Which Block to Replace 






























Least Recently Used (LRU) A replacement scheme in which the block replaced is the one that has 
been unused for the longest time. 
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О Fully associative: 


Contents of cache blocks after reference 


or miss Block 0 | Block 1 Block 2 Block 3 
mss | Memory) OOo o — 





Hit 


Address of memory 


block accessed 








8 miss Memory[0] Memory[8] 
Ü hit Memory[0] Memory[8] 
6 miss Memory[0] Memory[8] Memory[6] 
8 hit Memory[0] Memory[8] | Memory[6] 
Example 
(1 Set Associative Cache Organization 
Address 
31 30...12 11 1098...3210 
ЕССЕ 
22 8 
Іпаех V Tag Data V Tag Data м Tag Data V Tag Data 


0 | I | || 1 | j| j j РИ ЕЕ ЕЕЕ 
1 р ТБ Е БЕНЕН О ПЕ ПБ Tp БЕА 
2 4-І” ГГ T LI T ТІГІ 
— le ¢ + | SEL т yẹ fl rf | > 


253 [| ] C Erer | 


25 |l tt [| | dii bil j ili j| ш у: f j| 





32 





Hit Data 
4-way set-associative cache with 4 comparators and one 4-to-1 multiplexor: 
size of cache is 1K blocks = 256 sets * 4-block set size 
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Improving Cache Performance 
How? 


е Reduce Maiss Rate 
е Reduce Cache Miss Penalty 


е Reduce Cache Hit Time 





Improving Cache Performance 


% Average Memory Access Time (AMAT) 
AMAT = Hit time + Miss rate “ Miss penalty 


* Used as a framework for optimizations 
“ Reduce the Hit time 
<> Small and simple caches 
% Reduce the Miss Rate 
+ Larger cache size, higher associativity, and larger block size 
* Reduce the Miss Penalty 


< Multilevel caches 
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Multilevel Caches 


Primary cache attached to CPU 

v Small, but fast 

Level-2 cache services misses from primary cache 

v Larger, slower, but still faster than main memory 

v if miss occurs in primary cache second-level cache is accessed 

v if data is found in L-2 cache miss penalty is access time of L-2 cache 
which is much less than main memory access time 

Main memory services L-2 cache misses 

v if miss occurs again at L-2 then main memory access is required and 


large miss penalty is incurred 


Some high-end systems include L-3 cache 


Example: given 

CPU base CPI - 1, clock rate - 4GHz 

Miss rate/instruction - 2% 

Main memory access time - 100 ns 

With just primary cache: 

v Miss penalty = 100ns/0.25ns = 400 cycles 

v Effective CPI = 1 + 0.02 x 400 = 9 

Adding L-2 cache 

v Access time = 5 ns 

v Global miss rate to main memory = 0.5% 

miss penalty to L-2 cache (with L-2 hit)= 5ns / 0.25ns = 20 cycles 
Effective CPI = Base CPI + Primary stalls per instr. + Secondary 
stall per instr. = 1 + 2% x 20 + 0.5% x 400 = 3.4 

Performance ratio 

v machine with L-2 cache is faster by a factor of 9/3.4 = 2.6 


Ы Ы Б ы 
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Handling Cache Misses 


» cache miss A request for data from the cache that cannot be filled because the 
data 1s not present in the cache. 


» The control unit must detect a miss and process the miss by fetching the requested 
data from memory (or, as we shall see, a lower-level cache). Cache sends a miss 
signal to stall the processor. 


> If the cache reports а hit, the computer continues using the data as if nothing 
happened. Consequently, we can use the same basic control that we developed in 
(The processor: datapah and control , pipelining). The memories in the datapath 
are simply replaced by caches. 


Imm16 ED | E 






ALU result 32 


І-Сасһе 


Instruction 





> Instruction | 








Instruction Block 





Block Address 
D-Cache miss 





Block Address 


I-Cache miss ог D-Cache miss causes 
pipeline to stall 
Interface to L2 Cache or Main Memory 


> Modifying the control of a processor to handle a hit is trivial; misses, however, 
require some extra work. 


І-Сасһе miss 
Data Block 


> For a cache miss, we can stall the entire processor, essentially freezing the 
contents of the temporary and programmer-visible registers, while we wait for 
memory. In contrast, pipeline stalls, discussed in last chapter, are more complex 
because we must continue executing some instructions while we stall others. 
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We сап now define the steps (о һе taken оп ап instruction cache miss: 
1. Send the original PC value (current PC — 4) to the memory. 


2. Instruct main memory to perform a read and wait for the memory to complete 1ts 
access. 

3. Write the cache entry, putting the data from memory in the data portion of the entry, 
writing the upper bits of the address (from the ALU) into the tag field, and turning the 
valid bit on. 

4. Restart the instruction execution at the first step, which will refetch the instruction, 
this time finding it іп the cache. 

Тһе control of the cache on a data access 1s essentially identical: on a miss, we 


simply stall the processor until the memory responds with the data. 


Handling Writes 


When CPU writes to cache, we may use one of two policies: 
1- Write Through (Store through) 


» Write through is a storage method in which data is written into the cache and the 
corresponding main memory location at the same time (every write). Тһе cached data 
allows for fast retrieval on demand, while the same data in main memory ensures that 


nothing will get lost if a crash, power failure, or other system disruption occurs. 





> The other key aspect of writes is what occurs on a write miss. We first fetch the words of 
the block from memory. After the block is fetched and placed into the cache, we can 
overwrite the word that caused the miss into the cache block. We also write the word to 


main memory using the full address. 
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> Although write through minimizes the risk of data loss, every write operation must be 
done twice, and this redundancy takes time. (very simple but not provide very good 
performance). 
4 Solution: performance is improved with a write buffer. 
* Write buffer: A queue that holds data while the data is waiting to be written to 
memory. 
% CPU continues immediately. 


% Only stalls on write if write buffer is already full. 


» Write through is the preferred method of data storage in applications where data loss 
cannot be tolerated, such as banking and medical device control. In less critical 
applications, and especially when data volume is large, an alternative method called write 
back. 


2- Write Back: 
» Write back is astorage method in which data is written into the cache every time a change 
occurs, but is written into the corresponding location in main memory when it needs to 
be replaced or flushed. 


» Write back optimizes the system speed because it takes less time to write data into cache 
alone, as compared with writing the same data into both cache and main memory (write 
through). Write back is more efficient than write through, but more complex to 
implement. 
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Improving Cache Performance 
О Increasing Memory Bandwidth: Use DRAMs for main memory with 
Fixed width (e.g., 1 word). 
Example: Assuming cache block of 4 words 
- 1 clock cycle for address transfer (1 bus trip) 
- 15 clock cycles for each memory data access 
- 1 clock cycle per data transfer (1 bus trip) 


Bus 






c. Interleaved memory organization 





4-word wide memory and bus 4-bank interleaved memory 
Miss penalty =1 + 1*15 +1*1 = 17 bus cycles Miss penalty = 1 +1*15 + 4*1 = 20 bus cycles 
Bandwidth = 16 bytes / 17 cycles = 0.94 B/cycle Bandwidth = 16 bytes / 20 cycles = 0.8 B/cycle 
а One-word-wide 
memory organization 


4-word block, 1-word-wide memory 


| Miss penalty = 1 + 4x15 + 4х1 = 65 bus cycles 
Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle 
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Components of CPU time 
О Program execution cycles 

v Includes cache hit time 
О Memory stall cycles 

v Mainly from cache misses 
With simplifying assumptions: assume equal read and write miss penalties: 
CPU time - (execution cycles -- memory stall cycles) x cycle time 
memory stall cycles - memory accesses x miss rate x miss penalty 


— instructions/program x misses/instructions x miss penalty 


Therefore, two ways to improve performance in cache: 
(J decrease miss rate 


Cl decrease miss penalty 


Example: assuming 
I-cache miss rate = 2% 
D-cache miss rate - 4% 


Miss penalty - 100 cycles 


D Ü Ú O 


Base CPI (without memory stalls) = 2 


О Load & stores аге 36% of instructions 


О Miss cycles per instruction 


О I-cache: 0.02 x 100 = 2 
О D-cache: 0.36 x 0.04 x 100 = 1.44 


О Actual CPI (including memory stalls) = 2 + 2 + 1.44 = 5.44 


ы Ideal CPU is 5.44/2 =2.72 times faster 
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